├── LICENSE.md
├── README.md
├── base_model.py
├── config.py
├── dataset.py
├── eval.sh
├── examples
    ├── COCO_val2014_000000018295_result.jpg
    ├── COCO_val2014_000000072776_result.jpg
    ├── COCO_val2014_000000153130_result.jpg
    ├── COCO_val2014_000000214274_result.jpg
    ├── COCO_val2014_000000222261_result.jpg
    ├── COCO_val2014_000000261185_result.jpg
    ├── COCO_val2014_000000370315_result.jpg
    ├── COCO_val2014_000000535467_result.jpg
    └── examples.jpg
├── main.py
├── model.py
├── models
    ├── readme
    └── trim_model.py
├── summary
    └── readme
├── test
    ├── images
    │   ├── 1.jpg
    │   ├── 2.jpg
    │   └── 3.jpg
    └── results
    │   ├── 1_result.jpg
    │   ├── 2_result.jpg
    │   └── 3_result.jpg
├── train
    ├── images
    │   └── readme
    └── readme
├── utils
    ├── __init__.py
    ├── coco
    │   ├── __init__.py
    │   ├── coco.py
    │   ├── license.txt
    │   ├── pycocoevalcap
    │   │   ├── __init__.py
    │   │   ├── bleu
    │   │   │   ├── LICENSE
    │   │   │   ├── __init__.py
    │   │   │   ├── bleu.py
    │   │   │   └── bleu_scorer.py
    │   │   ├── cider
    │   │   │   ├── __init__.py
    │   │   │   ├── cider.py
    │   │   │   └── cider_scorer.py
    │   │   ├── eval.py
    │   │   ├── meteor
    │   │   │   ├── __init__.py
    │   │   │   ├── data
    │   │   │   │   └── paraphrase-en.gz
    │   │   │   ├── meteor-1.5.jar
    │   │   │   └── meteor.py
    │   │   ├── readme.md
    │   │   ├── rouge
    │   │   │   ├── __init__.py
    │   │   │   └── rouge.py
    │   │   └── tokenizer
    │   │   │   ├── __init__.py
    │   │   │   ├── ptbtokenizer.py
    │   │   │   └── stanford-corenlp-3.4.1.jar
    │   └── readme.md
    ├── ilsvrc_2012_mean.npy
    ├── misc.py
    ├── nn.py
    └── vocabulary.py
└── val
    ├── images
        └── readme
    └── readme


/LICENSE.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Guoming Wang & Wenhua Guan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ### Introduction
 2 | This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.
 3 | 
 4 | ### Prerequisites
 5 | * **Tensorflow** ([instructions](https://www.tensorflow.org/install/))
 6 | * **NumPy** ([instructions](https://scipy.org/install.html))
 7 | * **OpenCV** ([instructions](https://pypi.python.org/pypi/opencv-python))
 8 | * **Natural Language Toolkit (NLTK)** ([instructions](http://www.nltk.org/install.html))
 9 | * **Pandas** ([instructions](https://scipy.org/install.html))
10 | * **Matplotlib** ([instructions](https://scipy.org/install.html))
11 | * **tqdm** ([instructions](https://pypi.python.org/pypi/tqdm))
12 | 
13 | ### Usage
14 | * **Preparation:** Download the COCO train2014 and val2014 data [here](http://cocodataset.org/#download). Put the COCO train2014 images in the folder `train/images`, and put the file `captions_train2014.json` in the folder `train`. Similarly, put the COCO val2014 images in the folder `val/images`, and put the file `captions_val2014.json` in the folder `val`. Furthermore, download the pretrained VGG16 net [here](https://app.box.com/s/idt5khauxsamcg3y69jz13w6sc6122ph) or ResNet50 net [here](https://app.box.com/s/17vthb1zl0zeh340m4gaw0luuf2vscne) if you want to use it to initialize the CNN part.
15 | 
16 | * **Training:**
17 | To train a model using the COCO train2014 data, first setup various parameters in the file `config.py` and then run a command like this:
18 | ```shell
19 | python main.py --phase=train \
20 |     --load_cnn \
21 |     --cnn_model_file='./vgg16_no_fc.npy'\
22 |     [--train_cnn]    
23 | ```
24 | Turn on `--train_cnn` if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder `models`. If you want to resume the training from a checkpoint, run a command like this:
25 | ```shell
26 | python main.py --phase=train \
27 |     --load \
28 |     --model_file='./models/xxxxxx.npy'\
29 |     [--train_cnn]
30 | ```
31 | To monitor the progress of training, run the following command:
32 | ```shell
33 | tensorboard --logdir='./summary/'
34 | ```
35 | 
36 | * **Evaluation:**
37 | To evaluate a trained model using the COCO val2014 data, run a command like this:
38 | ```shell
39 | python main.py --phase=eval \
40 |     --model_file='./models/xxxxxx.npy' \
41 |     --beam_size=3
42 | ```
43 | The result will be shown in stdout. Furthermore, the generated captions will be saved in the file `val/results.json`.
44 | 
45 | * **Inference:**
46 | You can use the trained model to generate captions for any JPEG images! Put such images in the folder `test/images`, and run a command like this:
47 | ```shell
48 | python main.py --phase=test \
49 |     --model_file='./models/xxxxxx.npy' \
50 |     --beam_size=3
51 | ```
52 | The generated captions will be saved in the folder `test/results`.
53 | 
54 | ### Results
55 | A pretrained model with default configuration can be downloaded [here](https://app.box.com/s/xuigzzaqfbpnf76t295h109ey9po5t8p). This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with `beam size=3`):
56 | * **BLEU-1 = 70.3%**
57 | * **BLEU-2 = 53.6%**
58 | * **BLEU-3 = 39.8%**
59 | * **BLEU-4 = 29.5%**
60 | 
61 | Here are some captions generated by this model:
62 | ![examples](examples/examples.jpg)
63 | 
64 | ### References
65 | * [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015.
66 | * [The original implementation in Theano](https://github.com/kelvinxu/arctic-captions)
67 | * [An earlier implementation in Tensorflow](https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow)
68 | * [Microsoft COCO dataset](http://mscoco.org/)
69 | 


--------------------------------------------------------------------------------
/base_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import pandas as pd
  4 | import tensorflow as tf
  5 | import matplotlib.pyplot as plt
  6 | import cPickle as pickle
  7 | import copy
  8 | import json
  9 | from tqdm import tqdm
 10 | 
 11 | from utils.nn import NN
 12 | from utils.coco.coco import COCO
 13 | from utils.coco.pycocoevalcap.eval import COCOEvalCap
 14 | from utils.misc import ImageLoader, CaptionData, TopN
 15 | 
 16 | class BaseModel(object):
 17 |     def __init__(self, config):
 18 |         self.config = config
 19 |         self.is_train = True if config.phase == 'train' else False
 20 |         self.train_cnn = self.is_train and config.train_cnn
 21 |         self.image_loader = ImageLoader('./utils/ilsvrc_2012_mean.npy')
 22 |         self.image_shape = [224, 224, 3]
 23 |         self.nn = NN(config)
 24 |         self.global_step = tf.Variable(0,
 25 |                                        name = 'global_step',
 26 |                                        trainable = False)
 27 |         self.build()
 28 | 
 29 |     def build(self):
 30 |         raise NotImplementedError()
 31 | 
 32 |     def train(self, sess, train_data):
 33 |         """ Train the model using the COCO train2014 data. """
 34 |         print("Training the model...")
 35 |         config = self.config
 36 | 
 37 |         if not os.path.exists(config.summary_dir):
 38 |             os.mkdir(config.summary_dir)
 39 |         train_writer = tf.summary.FileWriter(config.summary_dir,
 40 |                                              sess.graph)
 41 | 
 42 |         for _ in tqdm(list(range(config.num_epochs)), desc='epoch'):
 43 |             for _ in tqdm(list(range(train_data.num_batches)), desc='batch'):
 44 |                 batch = train_data.next_batch()
 45 |                 image_files, sentences, masks = batch
 46 |                 images = self.image_loader.load_images(image_files)
 47 |                 feed_dict = {self.images: images,
 48 |                              self.sentences: sentences,
 49 |                              self.masks: masks}
 50 |                 _, summary, global_step = sess.run([self.opt_op,
 51 |                                                     self.summary,
 52 |                                                     self.global_step],
 53 |                                                     feed_dict=feed_dict)
 54 |                 if (global_step + 1) % config.save_period == 0:
 55 |                     self.save()
 56 |                 train_writer.add_summary(summary, global_step)
 57 |             train_data.reset()
 58 | 
 59 |         self.save()
 60 |         train_writer.close()
 61 |         print("Training complete.")
 62 | 
 63 |     def eval(self, sess, eval_gt_coco, eval_data, vocabulary):
 64 |         """ Evaluate the model using the COCO val2014 data. """
 65 |         print("Evaluating the model ...")
 66 |         config = self.config
 67 | 
 68 |         results = []
 69 |         if not os.path.exists(config.eval_result_dir):
 70 |             os.mkdir(config.eval_result_dir)
 71 | 
 72 |         # Generate the captions for the images
 73 |         idx = 0
 74 |         for k in tqdm(list(range(eval_data.num_batches)), desc='batch'):
 75 |             batch = eval_data.next_batch()
 76 |             caption_data = self.beam_search(sess, batch, vocabulary)
 77 | 
 78 |             fake_cnt = 0 if k<eval_data.num_batches-1 \
 79 |                          else eval_data.fake_count
 80 |             for l in range(eval_data.batch_size-fake_cnt):
 81 |                 word_idxs = caption_data[l][0].sentence
 82 |                 score = caption_data[l][0].score
 83 |                 caption = vocabulary.get_sentence(word_idxs)
 84 |                 results.append({'image_id': eval_data.image_ids[idx],
 85 |                                 'caption': caption})
 86 |                 idx += 1
 87 | 
 88 |                 # Save the result in an image file, if requested
 89 |                 if config.save_eval_result_as_image:
 90 |                     image_file = batch[l]
 91 |                     image_name = image_file.split(os.sep)[-1]
 92 |                     image_name = os.path.splitext(image_name)[0]
 93 |                     img = plt.imread(image_file)
 94 |                     plt.imshow(img)
 95 |                     plt.axis('off')
 96 |                     plt.title(caption)
 97 |                     plt.savefig(os.path.join(config.eval_result_dir,
 98 |                                              image_name+'_result.jpg'))
 99 | 
100 |         fp = open(config.eval_result_file, 'wb')
101 |         json.dump(results, fp)
102 |         fp.close()
103 | 
104 |         # Evaluate these captions
105 |         eval_result_coco = eval_gt_coco.loadRes(config.eval_result_file)
106 |         scorer = COCOEvalCap(eval_gt_coco, eval_result_coco)
107 |         scorer.evaluate()
108 |         print("Evaluation complete.")
109 | 
110 |     def test(self, sess, test_data, vocabulary):
111 |         """ Test the model using any given images. """
112 |         print("Testing the model ...")
113 |         config = self.config
114 | 
115 |         if not os.path.exists(config.test_result_dir):
116 |             os.mkdir(config.test_result_dir)
117 | 
118 |         captions = []
119 |         scores = []
120 | 
121 |         # Generate the captions for the images
122 |         for k in tqdm(list(range(test_data.num_batches)), desc='path'):
123 |             batch = test_data.next_batch()
124 |             caption_data = self.beam_search(sess, batch, vocabulary)
125 | 
126 |             fake_cnt = 0 if k<test_data.num_batches-1 \
127 |                          else test_data.fake_count
128 |             for l in range(test_data.batch_size-fake_cnt):
129 |                 word_idxs = caption_data[l][0].sentence
130 |                 score = caption_data[l][0].score
131 |                 caption = vocabulary.get_sentence(word_idxs)
132 |                 captions.append(caption)
133 |                 scores.append(score)
134 | 
135 |                 # Save the result in an image file
136 |                 image_file = batch[l]
137 |                 image_name = image_file.split(os.sep)[-1]
138 |                 image_name = os.path.splitext(image_name)[0]
139 |                 img = plt.imread(image_file)
140 |                 plt.imshow(img)
141 |                 plt.axis('off')
142 |                 plt.title(caption)
143 |                 plt.savefig(os.path.join(config.test_result_dir,
144 |                                          image_name+'_result.jpg'))
145 | 
146 |         # Save the captions to a file
147 |         results = pd.DataFrame({'image_files':test_data.image_files,
148 |                                 'caption':captions,
149 |                                 'prob':scores})
150 |         results.to_csv(config.test_result_file)
151 |         print("Testing complete.")
152 | 
153 |     def beam_search(self, sess, image_files, vocabulary):
154 |         """Use beam search to generate the captions for a batch of images."""
155 |         # Feed in the images to get the contexts and the initial LSTM states
156 |         config = self.config
157 |         images = self.image_loader.load_images(image_files)
158 |         contexts, initial_memory, initial_output = sess.run(
159 |             [self.conv_feats, self.initial_memory, self.initial_output],
160 |             feed_dict = {self.images: images})
161 | 
162 |         partial_caption_data = []
163 |         complete_caption_data = []
164 |         for k in range(config.batch_size):
165 |             initial_beam = CaptionData(sentence = [],
166 |                                        memory = initial_memory[k],
167 |                                        output = initial_output[k],
168 |                                        score = 1.0)
169 |             partial_caption_data.append(TopN(config.beam_size))
170 |             partial_caption_data[-1].push(initial_beam)
171 |             complete_caption_data.append(TopN(config.beam_size))
172 | 
173 |         # Run beam search
174 |         for idx in range(config.max_caption_length):
175 |             partial_caption_data_lists = []
176 |             for k in range(config.batch_size):
177 |                 data = partial_caption_data[k].extract()
178 |                 partial_caption_data_lists.append(data)
179 |                 partial_caption_data[k].reset()
180 | 
181 |             num_steps = 1 if idx == 0 else config.beam_size
182 |             for b in range(num_steps):
183 |                 if idx == 0:
184 |                     last_word = np.zeros((config.batch_size), np.int32)
185 |                 else:
186 |                     last_word = np.array([pcl[b].sentence[-1]
187 |                                         for pcl in partial_caption_data_lists],
188 |                                         np.int32)
189 | 
190 |                 last_memory = np.array([pcl[b].memory
191 |                                         for pcl in partial_caption_data_lists],
192 |                                         np.float32)
193 |                 last_output = np.array([pcl[b].output
194 |                                         for pcl in partial_caption_data_lists],
195 |                                         np.float32)
196 | 
197 |                 memory, output, scores = sess.run(
198 |                     [self.memory, self.output, self.probs],
199 |                     feed_dict = {self.contexts: contexts,
200 |                                  self.last_word: last_word,
201 |                                  self.last_memory: last_memory,
202 |                                  self.last_output: last_output})
203 | 
204 |                 # Find the beam_size most probable next words
205 |                 for k in range(config.batch_size):
206 |                     caption_data = partial_caption_data_lists[k][b]
207 |                     words_and_scores = list(enumerate(scores[k]))
208 |                     words_and_scores.sort(key=lambda x: -x[1])
209 |                     words_and_scores = words_and_scores[0:config.beam_size+1]
210 | 
211 |                     # Append each of these words to the current partial caption
212 |                     for w, s in words_and_scores:
213 |                         sentence = caption_data.sentence + [w]
214 |                         score = caption_data.score * s
215 |                         beam = CaptionData(sentence,
216 |                                            memory[k],
217 |                                            output[k],
218 |                                            score)
219 |                         if vocabulary.words[w] == '.':
220 |                             complete_caption_data[k].push(beam)
221 |                         else:
222 |                             partial_caption_data[k].push(beam)
223 | 
224 |         results = []
225 |         for k in range(config.batch_size):
226 |             if complete_caption_data[k].size() == 0:
227 |                 complete_caption_data[k] = partial_caption_data[k]
228 |             results.append(complete_caption_data[k].extract(sort=True))
229 | 
230 |         return results
231 | 
232 |     def save(self):
233 |         """ Save the model. """
234 |         config = self.config
235 |         data = {v.name: v.eval() for v in tf.global_variables()}
236 |         save_path = os.path.join(config.save_dir, str(self.global_step.eval()))
237 | 
238 |         print((" Saving the model to %s..." % (save_path+".npy")))
239 |         np.save(save_path, data)
240 |         info_file = open(os.path.join(config.save_dir, "config.pickle"), "wb")
241 |         config_ = copy.copy(config)
242 |         config_.global_step = self.global_step.eval()
243 |         pickle.dump(config_, info_file)
244 |         info_file.close()
245 |         print("Model saved.")
246 | 
247 |     def load(self, sess, model_file=None):
248 |         """ Load the model. """
249 |         config = self.config
250 |         if model_file is not None:
251 |             save_path = model_file
252 |         else:
253 |             info_path = os.path.join(config.save_dir, "config.pickle")
254 |             info_file = open(info_path, "rb")
255 |             config = pickle.load(info_file)
256 |             global_step = config.global_step
257 |             info_file.close()
258 |             save_path = os.path.join(config.save_dir,
259 |                                      str(global_step)+".npy")
260 | 
261 |         print("Loading the model from %s..." %save_path)
262 |         data_dict = np.load(save_path).item()
263 |         count = 0
264 |         for v in tqdm(tf.global_variables()):
265 |             if v.name in data_dict.keys():
266 |                 sess.run(v.assign(data_dict[v.name]))
267 |                 count += 1
268 |         print("%d tensors loaded." %count)
269 | 
270 |     def load_cnn(self, session, data_path, ignore_missing=True):
271 |         """ Load a pretrained CNN model. """
272 |         print("Loading the CNN from %s..." %data_path)
273 |         data_dict = np.load(data_path).item()
274 |         count = 0
275 |         for op_name in tqdm(data_dict):
276 |             with tf.variable_scope(op_name, reuse = True):
277 |                 for param_name, data in data_dict[op_name].iteritems():
278 |                     try:
279 |                         var = tf.get_variable(param_name)
280 |                         session.run(var.assign(data))
281 |                         count += 1
282 |                     except ValueError:
283 |                         pass
284 |         print("%d tensors loaded." %count)
285 | 


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | 
 2 | class Config(object):
 3 |     """ Wrapper class for various (hyper)parameters. """
 4 |     def __init__(self):
 5 |         # about the model architecture
 6 |         self.cnn = 'vgg16'               # 'vgg16' or 'resnet50'
 7 |         self.max_caption_length = 20
 8 |         self.dim_embedding = 512
 9 |         self.num_lstm_units = 512
10 |         self.num_initalize_layers = 2    # 1 or 2
11 |         self.dim_initalize_layer = 512
12 |         self.num_attend_layers = 2       # 1 or 2
13 |         self.dim_attend_layer = 512
14 |         self.num_decode_layers = 2       # 1 or 2
15 |         self.dim_decode_layer = 1024
16 | 
17 |         # about the weight initialization and regularization
18 |         self.fc_kernel_initializer_scale = 0.08
19 |         self.fc_kernel_regularizer_scale = 1e-4
20 |         self.fc_activity_regularizer_scale = 0.0
21 |         self.conv_kernel_regularizer_scale = 1e-4
22 |         self.conv_activity_regularizer_scale = 0.0
23 |         self.fc_drop_rate = 0.5
24 |         self.lstm_drop_rate = 0.3
25 |         self.attention_loss_factor = 0.01
26 | 
27 |         # about the optimization
28 |         self.num_epochs = 100
29 |         self.batch_size = 32
30 |         self.optimizer = 'Adam'    # 'Adam', 'RMSProp', 'Momentum' or 'SGD'
31 |         self.initial_learning_rate = 0.0001
32 |         self.learning_rate_decay_factor = 1.0
33 |         self.num_steps_per_decay = 100000
34 |         self.clip_gradients = 5.0
35 |         self.momentum = 0.0
36 |         self.use_nesterov = True
37 |         self.decay = 0.9
38 |         self.centered = True
39 |         self.beta1 = 0.9
40 |         self.beta2 = 0.999
41 |         self.epsilon = 1e-6
42 | 
43 |         # about the saver
44 |         self.save_period = 1000
45 |         self.save_dir = './models/'
46 |         self.summary_dir = './summary/'
47 | 
48 |         # about the vocabulary
49 |         self.vocabulary_file = './vocabulary.csv'
50 |         self.vocabulary_size = 5000
51 | 
52 |         # about the training
53 |         self.train_image_dir = './train/images/'
54 |         self.train_caption_file = './train/captions_train2014.json'
55 |         self.temp_annotation_file = './train/anns.csv'
56 |         self.temp_data_file = './train/data.npy'
57 | 
58 |         # about the evaluation
59 |         self.eval_image_dir = './val/images/'
60 |         self.eval_caption_file = './val/captions_val2014.json'
61 |         self.eval_result_dir = './val/results/'
62 |         self.eval_result_file = './val/results.json'
63 |         self.save_eval_result_as_image = False
64 | 
65 |         # about the testing
66 |         self.test_image_dir = './test/images/'
67 |         self.test_result_dir = './test/results/'
68 |         self.test_result_file = './test/results.csv'
69 | 


--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import math
  3 | import numpy as np
  4 | import pandas as pd
  5 | from tqdm import tqdm
  6 | 
  7 | from utils.coco.coco import COCO
  8 | from utils.vocabulary import Vocabulary
  9 | 
 10 | class DataSet(object):
 11 |     def __init__(self,
 12 |                  image_ids,
 13 |                  image_files,
 14 |                  batch_size,
 15 |                  word_idxs=None,
 16 |                  masks=None,
 17 |                  is_train=False,
 18 |                  shuffle=False):
 19 |         self.image_ids = np.array(image_ids)
 20 |         self.image_files = np.array(image_files)
 21 |         self.word_idxs = np.array(word_idxs)
 22 |         self.masks = np.array(masks)
 23 |         self.batch_size = batch_size
 24 |         self.is_train = is_train
 25 |         self.shuffle = shuffle
 26 |         self.setup()
 27 | 
 28 |     def setup(self):
 29 |         """ Setup the dataset. """
 30 |         self.count = len(self.image_ids)
 31 |         self.num_batches = int(np.ceil(self.count * 1.0 / self.batch_size))
 32 |         self.fake_count = self.num_batches * self.batch_size - self.count
 33 |         self.idxs = list(range(self.count))
 34 |         self.reset()
 35 | 
 36 |     def reset(self):
 37 |         """ Reset the dataset. """
 38 |         self.current_idx = 0
 39 |         if self.shuffle:
 40 |             np.random.shuffle(self.idxs)
 41 | 
 42 |     def next_batch(self):
 43 |         """ Fetch the next batch. """
 44 |         assert self.has_next_batch()
 45 | 
 46 |         if self.has_full_next_batch():
 47 |             start, end = self.current_idx, \
 48 |                          self.current_idx + self.batch_size
 49 |             current_idxs = self.idxs[start:end]
 50 |         else:
 51 |             start, end = self.current_idx, self.count
 52 |             current_idxs = self.idxs[start:end] + \
 53 |                            list(np.random.choice(self.count, self.fake_count))
 54 | 
 55 |         image_files = self.image_files[current_idxs]
 56 |         if self.is_train:
 57 |             word_idxs = self.word_idxs[current_idxs]
 58 |             masks = self.masks[current_idxs]
 59 |             self.current_idx += self.batch_size
 60 |             return image_files, word_idxs, masks
 61 |         else:
 62 |             self.current_idx += self.batch_size
 63 |             return image_files
 64 | 
 65 |     def has_next_batch(self):
 66 |         """ Determine whether there is a batch left. """
 67 |         return self.current_idx < self.count
 68 | 
 69 |     def has_full_next_batch(self):
 70 |         """ Determine whether there is a full batch left. """
 71 |         return self.current_idx + self.batch_size <= self.count
 72 | 
 73 | def prepare_train_data(config):
 74 |     """ Prepare the data for training the model. """
 75 |     coco = COCO(config.train_caption_file)
 76 |     coco.filter_by_cap_len(config.max_caption_length)
 77 | 
 78 |     print("Building the vocabulary...")
 79 |     vocabulary = Vocabulary(config.vocabulary_size)
 80 |     if not os.path.exists(config.vocabulary_file):
 81 |         vocabulary.build(coco.all_captions())
 82 |         vocabulary.save(config.vocabulary_file)
 83 |     else:
 84 |         vocabulary.load(config.vocabulary_file)
 85 |     print("Vocabulary built.")
 86 |     print("Number of words = %d" %(vocabulary.size))
 87 | 
 88 |     coco.filter_by_words(set(vocabulary.words))
 89 | 
 90 |     print("Processing the captions...")
 91 |     if not os.path.exists(config.temp_annotation_file):
 92 |         captions = [coco.anns[ann_id]['caption'] for ann_id in coco.anns]
 93 |         image_ids = [coco.anns[ann_id]['image_id'] for ann_id in coco.anns]
 94 |         image_files = [os.path.join(config.train_image_dir,
 95 |                                     coco.imgs[image_id]['file_name'])
 96 |                                     for image_id in image_ids]
 97 |         annotations = pd.DataFrame({'image_id': image_ids,
 98 |                                     'image_file': image_files,
 99 |                                     'caption': captions})
100 |         annotations.to_csv(config.temp_annotation_file)
101 |     else:
102 |         annotations = pd.read_csv(config.temp_annotation_file)
103 |         captions = annotations['caption'].values
104 |         image_ids = annotations['image_id'].values
105 |         image_files = annotations['image_file'].values
106 | 
107 |     if not os.path.exists(config.temp_data_file):
108 |         word_idxs = []
109 |         masks = []
110 |         for caption in tqdm(captions):
111 |             current_word_idxs_ = vocabulary.process_sentence(caption)
112 |             current_num_words = len(current_word_idxs_)
113 |             current_word_idxs = np.zeros(config.max_caption_length,
114 |                                          dtype = np.int32)
115 |             current_masks = np.zeros(config.max_caption_length)
116 |             current_word_idxs[:current_num_words] = np.array(current_word_idxs_)
117 |             current_masks[:current_num_words] = 1.0
118 |             word_idxs.append(current_word_idxs)
119 |             masks.append(current_masks)
120 |         word_idxs = np.array(word_idxs)
121 |         masks = np.array(masks)
122 |         data = {'word_idxs': word_idxs, 'masks': masks}
123 |         np.save(config.temp_data_file, data)
124 |     else:
125 |         data = np.load(config.temp_data_file).item()
126 |         word_idxs = data['word_idxs']
127 |         masks = data['masks']
128 |     print("Captions processed.")
129 |     print("Number of captions = %d" %(len(captions)))
130 | 
131 |     print("Building the dataset...")
132 |     dataset = DataSet(image_ids,
133 |                       image_files,
134 |                       config.batch_size,
135 |                       word_idxs,
136 |                       masks,
137 |                       True,
138 |                       True)
139 |     print("Dataset built.")
140 |     return dataset
141 | 
142 | def prepare_eval_data(config):
143 |     """ Prepare the data for evaluating the model. """
144 |     coco = COCO(config.eval_caption_file)
145 |     image_ids = list(coco.imgs.keys())
146 |     image_files = [os.path.join(config.eval_image_dir,
147 |                                 coco.imgs[image_id]['file_name'])
148 |                                 for image_id in image_ids]
149 | 
150 |     print("Building the vocabulary...")
151 |     if os.path.exists(config.vocabulary_file):
152 |         vocabulary = Vocabulary(config.vocabulary_size,
153 |                                 config.vocabulary_file)
154 |     else:
155 |         vocabulary = build_vocabulary(config)
156 |     print("Vocabulary built.")
157 |     print("Number of words = %d" %(vocabulary.size))
158 | 
159 |     print("Building the dataset...")
160 |     dataset = DataSet(image_ids, image_files, config.batch_size)
161 |     print("Dataset built.")
162 |     return coco, dataset, vocabulary
163 | 
164 | def prepare_test_data(config):
165 |     """ Prepare the data for testing the model. """
166 |     files = os.listdir(config.test_image_dir)
167 |     image_files = [os.path.join(config.test_image_dir, f) for f in files
168 |         if f.lower().endswith('.jpg') or f.lower().endswith('.jpeg')]
169 |     image_ids = list(range(len(image_files)))
170 | 
171 |     print("Building the vocabulary...")
172 |     if os.path.exists(config.vocabulary_file):
173 |         vocabulary = Vocabulary(config.vocabulary_size,
174 |                                 config.vocabulary_file)
175 |     else:
176 |         vocabulary = build_vocabulary(config)
177 |     print("Vocabulary built.")
178 |     print("Number of words = %d" %(vocabulary.size))
179 | 
180 |     print("Building the dataset...")
181 |     dataset = DataSet(image_ids, image_files, config.batch_size)
182 |     print("Dataset built.")
183 |     return dataset, vocabulary
184 | 
185 | def build_vocabulary(config):
186 |     """ Build the vocabulary from the training data and save it to a file. """
187 |     coco = COCO(config.train_caption_file)
188 |     coco.filter_by_cap_len(config.max_caption_length)
189 | 
190 |     vocabulary = Vocabulary(config.vocabulary_size)
191 |     vocabulary.build(coco.all_captions())
192 |     vocabulary.save(config.vocabulary_file)
193 |     return vocabulary
194 | 


--------------------------------------------------------------------------------
/eval.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | for file in ./models/*.npy
 4 | do
 5 |     filename=$(basename "$file")
 6 |     filename="${filename%.*}"
 7 |     python main.py --phase=eval --model_file="${file}" > "${filename}.txt"
 8 | done
 9 | exit 0
10 | 


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000018295_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000018295_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000072776_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000072776_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000153130_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000153130_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000214274_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000214274_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000222261_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000222261_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000261185_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000261185_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000370315_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000370315_result.jpg


--------------------------------------------------------------------------------
/examples/COCO_val2014_000000535467_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000535467_result.jpg


--------------------------------------------------------------------------------
/examples/examples.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/examples.jpg


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import tensorflow as tf
 3 | 
 4 | from config import Config
 5 | from model import CaptionGenerator
 6 | from dataset import prepare_train_data, prepare_eval_data, prepare_test_data
 7 | 
 8 | FLAGS = tf.app.flags.FLAGS
 9 | 
10 | tf.flags.DEFINE_string('phase', 'train',
11 |                        'The phase can be train, eval or test')
12 | 
13 | tf.flags.DEFINE_boolean('load', False,
14 |                         'Turn on to load a pretrained model from either \
15 |                         the latest checkpoint or a specified file')
16 | 
17 | tf.flags.DEFINE_string('model_file', None,
18 |                        'If sepcified, load a pretrained model from this file')
19 | 
20 | tf.flags.DEFINE_boolean('load_cnn', False,
21 |                         'Turn on to load a pretrained CNN model')
22 | 
23 | tf.flags.DEFINE_string('cnn_model_file', './vgg16_no_fc.npy',
24 |                        'The file containing a pretrained CNN model')
25 | 
26 | tf.flags.DEFINE_boolean('train_cnn', False,
27 |                         'Turn on to train both CNN and RNN. \
28 |                          Otherwise, only RNN is trained')
29 | 
30 | tf.flags.DEFINE_integer('beam_size', 3,
31 |                         'The size of beam search for caption generation')
32 | 
33 | def main(argv):
34 |     config = Config()
35 |     config.phase = FLAGS.phase
36 |     config.train_cnn = FLAGS.train_cnn
37 |     config.beam_size = FLAGS.beam_size
38 | 
39 |     with tf.Session() as sess:
40 |         if FLAGS.phase == 'train':
41 |             # training phase
42 |             data = prepare_train_data(config)
43 |             model = CaptionGenerator(config)
44 |             sess.run(tf.global_variables_initializer())
45 |             if FLAGS.load:
46 |                 model.load(sess, FLAGS.model_file)
47 |             if FLAGS.load_cnn:
48 |                 model.load_cnn(sess, FLAGS.cnn_model_file)
49 |             tf.get_default_graph().finalize()
50 |             model.train(sess, data)
51 | 
52 |         elif FLAGS.phase == 'eval':
53 |             # evaluation phase
54 |             coco, data, vocabulary = prepare_eval_data(config)
55 |             model = CaptionGenerator(config)
56 |             model.load(sess, FLAGS.model_file)
57 |             tf.get_default_graph().finalize()
58 |             model.eval(sess, coco, data, vocabulary)
59 | 
60 |         else:
61 |             # testing phase
62 |             data, vocabulary = prepare_test_data(config)
63 |             model = CaptionGenerator(config)
64 |             model.load(sess, FLAGS.model_file)
65 |             tf.get_default_graph().finalize()
66 |             model.test(sess, data, vocabulary)
67 | 
68 | if __name__ == '__main__':
69 |     tf.app.run()
70 | 


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | 
  4 | from base_model import BaseModel
  5 | 
  6 | class CaptionGenerator(BaseModel):
  7 |     def build(self):
  8 |         """ Build the model. """
  9 |         self.build_cnn()
 10 |         self.build_rnn()
 11 |         if self.is_train:
 12 |             self.build_optimizer()
 13 |             self.build_summary()
 14 | 
 15 |     def build_cnn(self):
 16 |         """ Build the CNN. """
 17 |         print("Building the CNN...")
 18 |         if self.config.cnn == 'vgg16':
 19 |             self.build_vgg16()
 20 |         else:
 21 |             self.build_resnet50()
 22 |         print("CNN built.")
 23 | 
 24 |     def build_vgg16(self):
 25 |         """ Build the VGG16 net. """
 26 |         config = self.config
 27 | 
 28 |         images = tf.placeholder(
 29 |             dtype = tf.float32,
 30 |             shape = [config.batch_size] + self.image_shape)
 31 | 
 32 |         conv1_1_feats = self.nn.conv2d(images, 64, name = 'conv1_1')
 33 |         conv1_2_feats = self.nn.conv2d(conv1_1_feats, 64, name = 'conv1_2')
 34 |         pool1_feats = self.nn.max_pool2d(conv1_2_feats, name = 'pool1')
 35 | 
 36 |         conv2_1_feats = self.nn.conv2d(pool1_feats, 128, name = 'conv2_1')
 37 |         conv2_2_feats = self.nn.conv2d(conv2_1_feats, 128, name = 'conv2_2')
 38 |         pool2_feats = self.nn.max_pool2d(conv2_2_feats, name = 'pool2')
 39 | 
 40 |         conv3_1_feats = self.nn.conv2d(pool2_feats, 256, name = 'conv3_1')
 41 |         conv3_2_feats = self.nn.conv2d(conv3_1_feats, 256, name = 'conv3_2')
 42 |         conv3_3_feats = self.nn.conv2d(conv3_2_feats, 256, name = 'conv3_3')
 43 |         pool3_feats = self.nn.max_pool2d(conv3_3_feats, name = 'pool3')
 44 | 
 45 |         conv4_1_feats = self.nn.conv2d(pool3_feats, 512, name = 'conv4_1')
 46 |         conv4_2_feats = self.nn.conv2d(conv4_1_feats, 512, name = 'conv4_2')
 47 |         conv4_3_feats = self.nn.conv2d(conv4_2_feats, 512, name = 'conv4_3')
 48 |         pool4_feats = self.nn.max_pool2d(conv4_3_feats, name = 'pool4')
 49 | 
 50 |         conv5_1_feats = self.nn.conv2d(pool4_feats, 512, name = 'conv5_1')
 51 |         conv5_2_feats = self.nn.conv2d(conv5_1_feats, 512, name = 'conv5_2')
 52 |         conv5_3_feats = self.nn.conv2d(conv5_2_feats, 512, name = 'conv5_3')
 53 | 
 54 |         reshaped_conv5_3_feats = tf.reshape(conv5_3_feats,
 55 |                                             [config.batch_size, 196, 512])
 56 | 
 57 |         self.conv_feats = reshaped_conv5_3_feats
 58 |         self.num_ctx = 196
 59 |         self.dim_ctx = 512
 60 |         self.images = images
 61 | 
 62 |     def build_resnet50(self):
 63 |         """ Build the ResNet50. """
 64 |         config = self.config
 65 | 
 66 |         images = tf.placeholder(
 67 |             dtype = tf.float32,
 68 |             shape = [config.batch_size] + self.image_shape)
 69 | 
 70 |         conv1_feats = self.nn.conv2d(images,
 71 |                                   filters = 64,
 72 |                                   kernel_size = (7, 7),
 73 |                                   strides = (2, 2),
 74 |                                   activation = None,
 75 |                                   name = 'conv1')
 76 |         conv1_feats = self.nn.batch_norm(conv1_feats, 'bn_conv1')
 77 |         conv1_feats = tf.nn.relu(conv1_feats)
 78 |         pool1_feats = self.nn.max_pool2d(conv1_feats,
 79 |                                       pool_size = (3, 3),
 80 |                                       strides = (2, 2),
 81 |                                       name = 'pool1')
 82 | 
 83 |         res2a_feats = self.resnet_block(pool1_feats, 'res2a', 'bn2a', 64, 1)
 84 |         res2b_feats = self.resnet_block2(res2a_feats, 'res2b', 'bn2b', 64)
 85 |         res2c_feats = self.resnet_block2(res2b_feats, 'res2c', 'bn2c', 64)
 86 | 
 87 |         res3a_feats = self.resnet_block(res2c_feats, 'res3a', 'bn3a', 128)
 88 |         res3b_feats = self.resnet_block2(res3a_feats, 'res3b', 'bn3b', 128)
 89 |         res3c_feats = self.resnet_block2(res3b_feats, 'res3c', 'bn3c', 128)
 90 |         res3d_feats = self.resnet_block2(res3c_feats, 'res3d', 'bn3d', 128)
 91 | 
 92 |         res4a_feats = self.resnet_block(res3d_feats, 'res4a', 'bn4a', 256)
 93 |         res4b_feats = self.resnet_block2(res4a_feats, 'res4b', 'bn4b', 256)
 94 |         res4c_feats = self.resnet_block2(res4b_feats, 'res4c', 'bn4c', 256)
 95 |         res4d_feats = self.resnet_block2(res4c_feats, 'res4d', 'bn4d', 256)
 96 |         res4e_feats = self.resnet_block2(res4d_feats, 'res4e', 'bn4e', 256)
 97 |         res4f_feats = self.resnet_block2(res4e_feats, 'res4f', 'bn4f', 256)
 98 | 
 99 |         res5a_feats = self.resnet_block(res4f_feats, 'res5a', 'bn5a', 512)
100 |         res5b_feats = self.resnet_block2(res5a_feats, 'res5b', 'bn5b', 512)
101 |         res5c_feats = self.resnet_block2(res5b_feats, 'res5c', 'bn5c', 512)
102 | 
103 |         reshaped_res5c_feats = tf.reshape(res5c_feats,
104 |                                          [config.batch_size, 49, 2048])
105 | 
106 |         self.conv_feats = reshaped_res5c_feats
107 |         self.num_ctx = 49
108 |         self.dim_ctx = 2048
109 |         self.images = images
110 | 
111 |     def resnet_block(self, inputs, name1, name2, c, s=2):
112 |         """ A basic block of ResNet. """
113 |         branch1_feats = self.nn.conv2d(inputs,
114 |                                     filters = 4*c,
115 |                                     kernel_size = (1, 1),
116 |                                     strides = (s, s),
117 |                                     activation = None,
118 |                                     use_bias = False,
119 |                                     name = name1+'_branch1')
120 |         branch1_feats = self.nn.batch_norm(branch1_feats, name2+'_branch1')
121 | 
122 |         branch2a_feats = self.nn.conv2d(inputs,
123 |                                      filters = c,
124 |                                      kernel_size = (1, 1),
125 |                                      strides = (s, s),
126 |                                      activation = None,
127 |                                      use_bias = False,
128 |                                      name = name1+'_branch2a')
129 |         branch2a_feats = self.nn.batch_norm(branch2a_feats, name2+'_branch2a')
130 |         branch2a_feats = tf.nn.relu(branch2a_feats)
131 | 
132 |         branch2b_feats = self.nn.conv2d(branch2a_feats,
133 |                                      filters = c,
134 |                                      kernel_size = (3, 3),
135 |                                      strides = (1, 1),
136 |                                      activation = None,
137 |                                      use_bias = False,
138 |                                      name = name1+'_branch2b')
139 |         branch2b_feats = self.nn.batch_norm(branch2b_feats, name2+'_branch2b')
140 |         branch2b_feats = tf.nn.relu(branch2b_feats)
141 | 
142 |         branch2c_feats = self.nn.conv2d(branch2b_feats,
143 |                                      filters = 4*c,
144 |                                      kernel_size = (1, 1),
145 |                                      strides = (1, 1),
146 |                                      activation = None,
147 |                                      use_bias = False,
148 |                                      name = name1+'_branch2c')
149 |         branch2c_feats = self.nn.batch_norm(branch2c_feats, name2+'_branch2c')
150 | 
151 |         outputs = branch1_feats + branch2c_feats
152 |         outputs = tf.nn.relu(outputs)
153 |         return outputs
154 | 
155 |     def resnet_block2(self, inputs, name1, name2, c):
156 |         """ Another basic block of ResNet. """
157 |         branch2a_feats = self.nn.conv2d(inputs,
158 |                                      filters = c,
159 |                                      kernel_size = (1, 1),
160 |                                      strides = (1, 1),
161 |                                      activation = None,
162 |                                      use_bias = False,
163 |                                      name = name1+'_branch2a')
164 |         branch2a_feats = self.nn.batch_norm(branch2a_feats, name2+'_branch2a')
165 |         branch2a_feats = tf.nn.relu(branch2a_feats)
166 | 
167 |         branch2b_feats = self.nn.conv2d(branch2a_feats,
168 |                                      filters = c,
169 |                                      kernel_size = (3, 3),
170 |                                      strides = (1, 1),
171 |                                      activation = None,
172 |                                      use_bias = False,
173 |                                      name = name1+'_branch2b')
174 |         branch2b_feats = self.nn.batch_norm(branch2b_feats, name2+'_branch2b')
175 |         branch2b_feats = tf.nn.relu(branch2b_feats)
176 | 
177 |         branch2c_feats = self.nn.conv2d(branch2b_feats,
178 |                                      filters = 4*c,
179 |                                      kernel_size = (1, 1),
180 |                                      strides = (1, 1),
181 |                                      activation = None,
182 |                                      use_bias = False,
183 |                                      name = name1+'_branch2c')
184 |         branch2c_feats = self.nn.batch_norm(branch2c_feats, name2+'_branch2c')
185 | 
186 |         outputs = inputs + branch2c_feats
187 |         outputs = tf.nn.relu(outputs)
188 |         return outputs
189 | 
190 |     def build_rnn(self):
191 |         """ Build the RNN. """
192 |         print("Building the RNN...")
193 |         config = self.config
194 | 
195 |         # Setup the placeholders
196 |         if self.is_train:
197 |             contexts = self.conv_feats
198 |             sentences = tf.placeholder(
199 |                 dtype = tf.int32,
200 |                 shape = [config.batch_size, config.max_caption_length])
201 |             masks = tf.placeholder(
202 |                 dtype = tf.float32,
203 |                 shape = [config.batch_size, config.max_caption_length])
204 |         else:
205 |             contexts = tf.placeholder(
206 |                 dtype = tf.float32,
207 |                 shape = [config.batch_size, self.num_ctx, self.dim_ctx])
208 |             last_memory = tf.placeholder(
209 |                 dtype = tf.float32,
210 |                 shape = [config.batch_size, config.num_lstm_units])
211 |             last_output = tf.placeholder(
212 |                 dtype = tf.float32,
213 |                 shape = [config.batch_size, config.num_lstm_units])
214 |             last_word = tf.placeholder(
215 |                 dtype = tf.int32,
216 |                 shape = [config.batch_size])
217 | 
218 |         # Setup the word embedding
219 |         with tf.variable_scope("word_embedding"):
220 |             embedding_matrix = tf.get_variable(
221 |                 name = 'weights',
222 |                 shape = [config.vocabulary_size, config.dim_embedding],
223 |                 initializer = self.nn.fc_kernel_initializer,
224 |                 regularizer = self.nn.fc_kernel_regularizer,
225 |                 trainable = self.is_train)
226 | 
227 |         # Setup the LSTM
228 |         lstm = tf.nn.rnn_cell.LSTMCell(
229 |             config.num_lstm_units,
230 |             initializer = self.nn.fc_kernel_initializer)
231 |         if self.is_train:
232 |             lstm = tf.nn.rnn_cell.DropoutWrapper(
233 |                 lstm,
234 |                 input_keep_prob = 1.0-config.lstm_drop_rate,
235 |                 output_keep_prob = 1.0-config.lstm_drop_rate,
236 |                 state_keep_prob = 1.0-config.lstm_drop_rate)
237 | 
238 |         # Initialize the LSTM using the mean context
239 |         with tf.variable_scope("initialize"):
240 |             context_mean = tf.reduce_mean(self.conv_feats, axis = 1)
241 |             initial_memory, initial_output = self.initialize(context_mean)
242 |             initial_state = initial_memory, initial_output
243 | 
244 |         # Prepare to run
245 |         predictions = []
246 |         if self.is_train:
247 |             alphas = []
248 |             cross_entropies = []
249 |             predictions_correct = []
250 |             num_steps = config.max_caption_length
251 |             last_output = initial_output
252 |             last_memory = initial_memory
253 |             last_word = tf.zeros([config.batch_size], tf.int32)
254 |         else:
255 |             num_steps = 1
256 |         last_state = last_memory, last_output
257 | 
258 |         # Generate the words one by one
259 |         for idx in range(num_steps):
260 |             # Attention mechanism
261 |             with tf.variable_scope("attend"):
262 |                 alpha = self.attend(contexts, last_output)
263 |                 context = tf.reduce_sum(contexts*tf.expand_dims(alpha, 2),
264 |                                         axis = 1)
265 |                 if self.is_train:
266 |                     tiled_masks = tf.tile(tf.expand_dims(masks[:, idx], 1),
267 |                                          [1, self.num_ctx])
268 |                     masked_alpha = alpha * tiled_masks
269 |                     alphas.append(tf.reshape(masked_alpha, [-1]))
270 | 
271 |             # Embed the last word
272 |             with tf.variable_scope("word_embedding"):
273 |                 word_embed = tf.nn.embedding_lookup(embedding_matrix,
274 |                                                     last_word)
275 |            # Apply the LSTM
276 |             with tf.variable_scope("lstm"):
277 |                 current_input = tf.concat([context, word_embed], 1)
278 |                 output, state = lstm(current_input, last_state)
279 |                 memory, _ = state
280 | 
281 |             # Decode the expanded output of LSTM into a word
282 |             with tf.variable_scope("decode"):
283 |                 expanded_output = tf.concat([output,
284 |                                              context,
285 |                                              word_embed],
286 |                                              axis = 1)
287 |                 logits = self.decode(expanded_output)
288 |                 probs = tf.nn.softmax(logits)
289 |                 prediction = tf.argmax(logits, 1)
290 |                 predictions.append(prediction)
291 | 
292 |             # Compute the loss for this step, if necessary
293 |             if self.is_train:
294 |                 cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
295 |                     labels = sentences[:, idx],
296 |                     logits = logits)
297 |                 masked_cross_entropy = cross_entropy * masks[:, idx]
298 |                 cross_entropies.append(masked_cross_entropy)
299 | 
300 |                 ground_truth = tf.cast(sentences[:, idx], tf.int64)
301 |                 prediction_correct = tf.where(
302 |                     tf.equal(prediction, ground_truth),
303 |                     tf.cast(masks[:, idx], tf.float32),
304 |                     tf.cast(tf.zeros_like(prediction), tf.float32))
305 |                 predictions_correct.append(prediction_correct)
306 | 
307 |                 last_output = output
308 |                 last_memory = memory
309 |                 last_state = state
310 |                 last_word = sentences[:, idx]
311 | 
312 |             tf.get_variable_scope().reuse_variables()
313 | 
314 |         # Compute the final loss, if necessary
315 |         if self.is_train:
316 |             cross_entropies = tf.stack(cross_entropies, axis = 1)
317 |             cross_entropy_loss = tf.reduce_sum(cross_entropies) \
318 |                                  / tf.reduce_sum(masks)
319 | 
320 |             alphas = tf.stack(alphas, axis = 1)
321 |             alphas = tf.reshape(alphas, [config.batch_size, self.num_ctx, -1])
322 |             attentions = tf.reduce_sum(alphas, axis = 2)
323 |             diffs = tf.ones_like(attentions) - attentions
324 |             attention_loss = config.attention_loss_factor \
325 |                              * tf.nn.l2_loss(diffs) \
326 |                              / (config.batch_size * self.num_ctx)
327 | 
328 |             reg_loss = tf.losses.get_regularization_loss()
329 | 
330 |             total_loss = cross_entropy_loss + attention_loss + reg_loss
331 | 
332 |             predictions_correct = tf.stack(predictions_correct, axis = 1)
333 |             accuracy = tf.reduce_sum(predictions_correct) \
334 |                        / tf.reduce_sum(masks)
335 | 
336 |         self.contexts = contexts
337 |         if self.is_train:
338 |             self.sentences = sentences
339 |             self.masks = masks
340 |             self.total_loss = total_loss
341 |             self.cross_entropy_loss = cross_entropy_loss
342 |             self.attention_loss = attention_loss
343 |             self.reg_loss = reg_loss
344 |             self.accuracy = accuracy
345 |             self.attentions = attentions
346 |         else:
347 |             self.initial_memory = initial_memory
348 |             self.initial_output = initial_output
349 |             self.last_memory = last_memory
350 |             self.last_output = last_output
351 |             self.last_word = last_word
352 |             self.memory = memory
353 |             self.output = output
354 |             self.probs = probs
355 | 
356 |         print("RNN built.")
357 | 
358 |     def initialize(self, context_mean):
359 |         """ Initialize the LSTM using the mean context. """
360 |         config = self.config
361 |         context_mean = self.nn.dropout(context_mean)
362 |         if config.num_initalize_layers == 1:
363 |             # use 1 fc layer to initialize
364 |             memory = self.nn.dense(context_mean,
365 |                                    units = config.num_lstm_units,
366 |                                    activation = None,
367 |                                    name = 'fc_a')
368 |             output = self.nn.dense(context_mean,
369 |                                    units = config.num_lstm_units,
370 |                                    activation = None,
371 |                                    name = 'fc_b')
372 |         else:
373 |             # use 2 fc layers to initialize
374 |             temp1 = self.nn.dense(context_mean,
375 |                                   units = config.dim_initalize_layer,
376 |                                   activation = tf.tanh,
377 |                                   name = 'fc_a1')
378 |             temp1 = self.nn.dropout(temp1)
379 |             memory = self.nn.dense(temp1,
380 |                                    units = config.num_lstm_units,
381 |                                    activation = None,
382 |                                    name = 'fc_a2')
383 | 
384 |             temp2 = self.nn.dense(context_mean,
385 |                                   units = config.dim_initalize_layer,
386 |                                   activation = tf.tanh,
387 |                                   name = 'fc_b1')
388 |             temp2 = self.nn.dropout(temp2)
389 |             output = self.nn.dense(temp2,
390 |                                    units = config.num_lstm_units,
391 |                                    activation = None,
392 |                                    name = 'fc_b2')
393 |         return memory, output
394 | 
395 |     def attend(self, contexts, output):
396 |         """ Attention Mechanism. """
397 |         config = self.config
398 |         reshaped_contexts = tf.reshape(contexts, [-1, self.dim_ctx])
399 |         reshaped_contexts = self.nn.dropout(reshaped_contexts)
400 |         output = self.nn.dropout(output)
401 |         if config.num_attend_layers == 1:
402 |             # use 1 fc layer to attend
403 |             logits1 = self.nn.dense(reshaped_contexts,
404 |                                     units = 1,
405 |                                     activation = None,
406 |                                     use_bias = False,
407 |                                     name = 'fc_a')
408 |             logits1 = tf.reshape(logits1, [-1, self.num_ctx])
409 |             logits2 = self.nn.dense(output,
410 |                                     units = self.num_ctx,
411 |                                     activation = None,
412 |                                     use_bias = False,
413 |                                     name = 'fc_b')
414 |             logits = logits1 + logits2
415 |         else:
416 |             # use 2 fc layers to attend
417 |             temp1 = self.nn.dense(reshaped_contexts,
418 |                                   units = config.dim_attend_layer,
419 |                                   activation = tf.tanh,
420 |                                   name = 'fc_1a')
421 |             temp2 = self.nn.dense(output,
422 |                                   units = config.dim_attend_layer,
423 |                                   activation = tf.tanh,
424 |                                   name = 'fc_1b')
425 |             temp2 = tf.tile(tf.expand_dims(temp2, 1), [1, self.num_ctx, 1])
426 |             temp2 = tf.reshape(temp2, [-1, config.dim_attend_layer])
427 |             temp = temp1 + temp2
428 |             temp = self.nn.dropout(temp)
429 |             logits = self.nn.dense(temp,
430 |                                    units = 1,
431 |                                    activation = None,
432 |                                    use_bias = False,
433 |                                    name = 'fc_2')
434 |             logits = tf.reshape(logits, [-1, self.num_ctx])
435 |         alpha = tf.nn.softmax(logits)
436 |         return alpha
437 | 
438 |     def decode(self, expanded_output):
439 |         """ Decode the expanded output of the LSTM into a word. """
440 |         config = self.config
441 |         expanded_output = self.nn.dropout(expanded_output)
442 |         if config.num_decode_layers == 1:
443 |             # use 1 fc layer to decode
444 |             logits = self.nn.dense(expanded_output,
445 |                                    units = config.vocabulary_size,
446 |                                    activation = None,
447 |                                    name = 'fc')
448 |         else:
449 |             # use 2 fc layers to decode
450 |             temp = self.nn.dense(expanded_output,
451 |                                  units = config.dim_decode_layer,
452 |                                  activation = tf.tanh,
453 |                                  name = 'fc_1')
454 |             temp = self.nn.dropout(temp)
455 |             logits = self.nn.dense(temp,
456 |                                    units = config.vocabulary_size,
457 |                                    activation = None,
458 |                                    name = 'fc_2')
459 |         return logits
460 | 
461 |     def build_optimizer(self):
462 |         """ Setup the optimizer and training operation. """
463 |         config = self.config
464 | 
465 |         learning_rate = tf.constant(config.initial_learning_rate)
466 |         if config.learning_rate_decay_factor < 1.0:
467 |             def _learning_rate_decay_fn(learning_rate, global_step):
468 |                 return tf.train.exponential_decay(
469 |                     learning_rate,
470 |                     global_step,
471 |                     decay_steps = config.num_steps_per_decay,
472 |                     decay_rate = config.learning_rate_decay_factor,
473 |                     staircase = True)
474 |             learning_rate_decay_fn = _learning_rate_decay_fn
475 |         else:
476 |             learning_rate_decay_fn = None
477 | 
478 |         with tf.variable_scope('optimizer', reuse = tf.AUTO_REUSE):
479 |             if config.optimizer == 'Adam':
480 |                 optimizer = tf.train.AdamOptimizer(
481 |                     learning_rate = config.initial_learning_rate,
482 |                     beta1 = config.beta1,
483 |                     beta2 = config.beta2,
484 |                     epsilon = config.epsilon
485 |                     )
486 |             elif config.optimizer == 'RMSProp':
487 |                 optimizer = tf.train.RMSPropOptimizer(
488 |                     learning_rate = config.initial_learning_rate,
489 |                     decay = config.decay,
490 |                     momentum = config.momentum,
491 |                     centered = config.centered,
492 |                     epsilon = config.epsilon
493 |                 )
494 |             elif config.optimizer == 'Momentum':
495 |                 optimizer = tf.train.MomentumOptimizer(
496 |                     learning_rate = config.initial_learning_rate,
497 |                     momentum = config.momentum,
498 |                     use_nesterov = config.use_nesterov
499 |                 )
500 |             else:
501 |                 optimizer = tf.train.GradientDescentOptimizer(
502 |                     learning_rate = config.initial_learning_rate
503 |                 )
504 | 
505 |             opt_op = tf.contrib.layers.optimize_loss(
506 |                 loss = self.total_loss,
507 |                 global_step = self.global_step,
508 |                 learning_rate = learning_rate,
509 |                 optimizer = optimizer,
510 |                 clip_gradients = config.clip_gradients,
511 |                 learning_rate_decay_fn = learning_rate_decay_fn)
512 | 
513 |         self.opt_op = opt_op
514 | 
515 |     def build_summary(self):
516 |         """ Build the summary (for TensorBoard visualization). """
517 |         with tf.name_scope("variables"):
518 |             for var in tf.trainable_variables():
519 |                 with tf.name_scope(var.name[:var.name.find(":")]):
520 |                     self.variable_summary(var)
521 | 
522 |         with tf.name_scope("metrics"):
523 |             tf.summary.scalar("cross_entropy_loss", self.cross_entropy_loss)
524 |             tf.summary.scalar("attention_loss", self.attention_loss)
525 |             tf.summary.scalar("reg_loss", self.reg_loss)
526 |             tf.summary.scalar("total_loss", self.total_loss)
527 |             tf.summary.scalar("accuracy", self.accuracy)
528 | 
529 |         with tf.name_scope("attentions"):
530 |             self.variable_summary(self.attentions)
531 | 
532 |         self.summary = tf.summary.merge_all()
533 | 
534 |     def variable_summary(self, var):
535 |         """ Build the summary for a variable. """
536 |         mean = tf.reduce_mean(var)
537 |         tf.summary.scalar('mean', mean)
538 |         stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
539 |         tf.summary.scalar('stddev', stddev)
540 |         tf.summary.scalar('max', tf.reduce_max(var))
541 |         tf.summary.scalar('min', tf.reduce_min(var))
542 |         tf.summary.histogram('histogram', var)
543 | 


--------------------------------------------------------------------------------
/models/readme:
--------------------------------------------------------------------------------
1 | The trained models will be saved here.
2 | 


--------------------------------------------------------------------------------
/models/trim_model.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | # Run this script to remove the data that are only useful for training
 4 | # from your model files in order to make them more compact.
 5 | 
 6 | import os
 7 | import numpy as np
 8 | 
 9 | if __name__=='__main__':
10 |     files = os.listdir('.')
11 |     model_files = [f for f in files if f.endswith('.npy')]
12 | 
13 |     for model_file in model_files:
14 |         model = np.load(model_file).item()
15 |         trimmed_model = {var_name: model[var_name] for var_name in model.keys()
16 |                          if 'optimizer' not in var_name}
17 |         os.rename(model_file, model_file[:-4]+'_old.npy')
18 |         np.save(model_file, trimmed_model)
19 | 


--------------------------------------------------------------------------------
/summary/readme:
--------------------------------------------------------------------------------
1 | The summary (for TensorBoard visualization) will be saved here.
2 | 


--------------------------------------------------------------------------------
/test/images/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/images/1.jpg


--------------------------------------------------------------------------------
/test/images/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/images/2.jpg


--------------------------------------------------------------------------------
/test/images/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/images/3.jpg


--------------------------------------------------------------------------------
/test/results/1_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/results/1_result.jpg


--------------------------------------------------------------------------------
/test/results/2_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/results/2_result.jpg


--------------------------------------------------------------------------------
/test/results/3_result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/results/3_result.jpg


--------------------------------------------------------------------------------
/train/images/readme:
--------------------------------------------------------------------------------
1 | Put the COCO train2014 images here.
2 | 


--------------------------------------------------------------------------------
/train/readme:
--------------------------------------------------------------------------------
1 | Put the file captions_train2014.json here.
2 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/__init__.py


--------------------------------------------------------------------------------
/utils/coco/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/utils/coco/coco.py:
--------------------------------------------------------------------------------
  1 | __author__ = 'tylin'
  2 | __version__ = '2.0'
  3 | # Interface for accessing the Microsoft COCO dataset.
  4 | 
  5 | # Microsoft COCO is a large image dataset designed for object detection,
  6 | # segmentation, and caption generation. pycocotools is a Python API that
  7 | # assists in loading, parsing and visualizing the annotations in COCO.
  8 | # Please visit http://mscoco.org/ for more information on COCO, including
  9 | # for the data, paper, and tutorials. The exact format of the annotations
 10 | # is also described on the COCO website. For example usage of the pycocotools
 11 | # please see pycocotools_demo.ipynb. In addition to this API, please download both
 12 | # the COCO images and annotations in order to run the demo.
 13 | 
 14 | # An alternative to using the API is to load the annotations directly
 15 | # into Python dictionary
 16 | # Using the API provides additional utility functions. Note that this API
 17 | # supports both *instance* and *caption* annotations. In the case of
 18 | # captions not all functions are defined (e.g. categories are undefined).
 19 | 
 20 | # The following API functions are defined:
 21 | #  COCO       - COCO api class that loads COCO annotation file and prepare data structures.
 22 | #  decodeMask - Decode binary mask M encoded via run-length encoding.
 23 | #  encodeMask - Encode binary mask M using run-length encoding.
 24 | #  getAnnIds  - Get ann ids that satisfy given filter conditions.
 25 | #  getCatIds  - Get cat ids that satisfy given filter conditions.
 26 | #  getImgIds  - Get img ids that satisfy given filter conditions.
 27 | #  loadAnns   - Load anns with the specified ids.
 28 | #  loadCats   - Load cats with the specified ids.
 29 | #  loadImgs   - Load imgs with the specified ids.
 30 | #  segToMask  - Convert polygon segmentation to binary mask.
 31 | #  showAnns   - Display the specified annotations.
 32 | #  loadRes    - Load algorithm results and create API for accessing them.
 33 | #  download   - Download COCO images from mscoco.org server.
 34 | # Throughout the API "ann"=annotation, "cat"=category, and "img"=image.
 35 | # Help on each functions can be accessed by: "help COCO>function".
 36 | 
 37 | # See also COCO>decodeMask,
 38 | # COCO>encodeMask, COCO>getAnnIds, COCO>getCatIds,
 39 | # COCO>getImgIds, COCO>loadAnns, COCO>loadCats,
 40 | # COCO>loadImgs, COCO>segToMask, COCO>showAnns
 41 | 
 42 | # Microsoft COCO Toolbox.      version 2.0
 43 | # Data, paper, and tutorials available at:  http://mscoco.org/
 44 | # Code written by Piotr Dollar and Tsung-Yi Lin, 2014.
 45 | # Licensed under the Simplified BSD License [see bsd.txt]
 46 | 
 47 | import json
 48 | import datetime
 49 | import time
 50 | import matplotlib.pyplot as plt
 51 | from matplotlib.collections import PatchCollection
 52 | from matplotlib.patches import Polygon
 53 | import numpy as np
 54 | from skimage.draw import polygon
 55 | import urllib
 56 | import copy
 57 | import itertools
 58 | import os
 59 | import string
 60 | from tqdm import tqdm
 61 | from nltk.tokenize import word_tokenize
 62 | 
 63 | class COCO:
 64 |     def __init__(self, annotation_file=None):
 65 |         """
 66 |         Constructor of Microsoft COCO helper class for reading and visualizing annotations.
 67 |         :param annotation_file (str): location of annotation file
 68 |         :param image_folder (str): location to the folder that hosts images.
 69 |         :return:
 70 |         """
 71 |         # load dataset
 72 |         self.dataset = {}
 73 |         self.anns = []
 74 |         self.imgToAnns = {}
 75 |         self.catToImgs = {}
 76 |         self.imgs = {}
 77 |         self.cats = {}
 78 |         self.img_name_to_id = {}
 79 | 
 80 |         if not annotation_file == None:
 81 |             print 'loading annotations into memory...'
 82 |             tic = time.time()
 83 |             dataset = json.load(open(annotation_file, 'r'))
 84 |             print 'Done (t=%0.2fs)'%(time.time()- tic)
 85 |             self.dataset = dataset
 86 |             self.process_dataset()
 87 |             self.createIndex()
 88 | 
 89 |     def createIndex(self):
 90 |         # create index
 91 |         print 'creating index...'
 92 |         anns = {}
 93 |         imgToAnns = {}
 94 |         catToImgs = {}
 95 |         cats = {}
 96 |         imgs = {}
 97 |         img_name_to_id = {}
 98 | 
 99 |         if 'annotations' in self.dataset:
100 |             imgToAnns = {ann['image_id']: [] for ann in self.dataset['annotations']}
101 |             anns =      {ann['id']:       [] for ann in self.dataset['annotations']}
102 |             for ann in self.dataset['annotations']:
103 |                 imgToAnns[ann['image_id']] += [ann]
104 |                 anns[ann['id']] = ann
105 | 
106 |         if 'images' in self.dataset:
107 |             imgs      = {im['id']: {} for im in self.dataset['images']}
108 |             for img in self.dataset['images']:
109 |                 imgs[img['id']] = img
110 |                 img_name_to_id[img['file_name']] = img['id']
111 | 
112 |         if 'categories' in self.dataset:
113 |             cats = {cat['id']: [] for cat in self.dataset['categories']}
114 |             for cat in self.dataset['categories']:
115 |                 cats[cat['id']] = cat
116 |             catToImgs = {cat['id']: [] for cat in self.dataset['categories']}
117 |             for ann in self.dataset['annotations']:
118 |                 catToImgs[ann['category_id']] += [ann['image_id']]
119 | 
120 |         print 'index created!'
121 | 
122 |         # create class members
123 |         self.anns = anns
124 |         self.imgToAnns = imgToAnns
125 |         self.catToImgs = catToImgs
126 |         self.imgs = imgs
127 |         self.cats = cats
128 |         self.img_name_to_id = img_name_to_id
129 | 
130 |     def info(self):
131 |         """
132 |         Print information about the annotation file.
133 |         :return:
134 |         """
135 |         for key, value in self.dataset['info'].items():
136 |             print '%s: %s'%(key, value)
137 | 
138 |     def getAnnIds(self, imgIds=[], catIds=[], areaRng=[], iscrowd=None):
139 |         """
140 |         Get ann ids that satisfy given filter conditions. default skips that filter
141 |         :param imgIds  (int array)     : get anns for given imgs
142 |                catIds  (int array)     : get anns for given cats
143 |                areaRng (float array)   : get anns for given area range (e.g. [0 inf])
144 |                iscrowd (boolean)       : get anns for given crowd label (False or True)
145 |         :return: ids (int array)       : integer array of ann ids
146 |         """
147 |         imgIds = imgIds if type(imgIds) == list else [imgIds]
148 |         catIds = catIds if type(catIds) == list else [catIds]
149 | 
150 |         if len(imgIds) == len(catIds) == len(areaRng) == 0:
151 |             anns = self.dataset['annotations']
152 |         else:
153 |             if not len(imgIds) == 0:
154 |                 # this can be changed by defaultdict
155 |                 lists = [self.imgToAnns[imgId] for imgId in imgIds if imgId in self.imgToAnns]
156 |                 anns = list(itertools.chain.from_iterable(lists))
157 |             else:
158 |                 anns = self.dataset['annotations']
159 |             anns = anns if len(catIds)  == 0 else [ann for ann in anns if ann['category_id'] in catIds]
160 |             anns = anns if len(areaRng) == 0 else [ann for ann in anns if ann['area'] > areaRng[0] and ann['area'] < areaRng[1]]
161 |         if not iscrowd == None:
162 |             ids = [ann['id'] for ann in anns if ann['iscrowd'] == iscrowd]
163 |         else:
164 |             ids = [ann['id'] for ann in anns]
165 |         return ids
166 | 
167 |     def getCatIds(self, catNms=[], supNms=[], catIds=[]):
168 |         """
169 |         filtering parameters. default skips that filter.
170 |         :param catNms (str array)  : get cats for given cat names
171 |         :param supNms (str array)  : get cats for given supercategory names
172 |         :param catIds (int array)  : get cats for given cat ids
173 |         :return: ids (int array)   : integer array of cat ids
174 |         """
175 |         catNms = catNms if type(catNms) == list else [catNms]
176 |         supNms = supNms if type(supNms) == list else [supNms]
177 |         catIds = catIds if type(catIds) == list else [catIds]
178 | 
179 |         if len(catNms) == len(supNms) == len(catIds) == 0:
180 |             cats = self.dataset['categories']
181 |         else:
182 |             cats = self.dataset['categories']
183 |             cats = cats if len(catNms) == 0 else [cat for cat in cats if cat['name']          in catNms]
184 |             cats = cats if len(supNms) == 0 else [cat for cat in cats if cat['supercategory'] in supNms]
185 |             cats = cats if len(catIds) == 0 else [cat for cat in cats if cat['id']            in catIds]
186 |         ids = [cat['id'] for cat in cats]
187 |         return ids
188 | 
189 |     def getImgIds(self, imgIds=[], catIds=[]):
190 |         '''
191 |         Get img ids that satisfy given filter conditions.
192 |         :param imgIds (int array) : get imgs for given ids
193 |         :param catIds (int array) : get imgs with all given cats
194 |         :return: ids (int array)  : integer array of img ids
195 |         '''
196 |         imgIds = imgIds if type(imgIds) == list else [imgIds]
197 |         catIds = catIds if type(catIds) == list else [catIds]
198 | 
199 |         if len(imgIds) == len(catIds) == 0:
200 |             ids = self.imgs.keys()
201 |         else:
202 |             ids = set(imgIds)
203 |             for i, catId in enumerate(catIds):
204 |                 if i == 0 and len(ids) == 0:
205 |                     ids = set(self.catToImgs[catId])
206 |                 else:
207 |                     ids &= set(self.catToImgs[catId])
208 |         return list(ids)
209 | 
210 |     def loadAnns(self, ids=[]):
211 |         """
212 |         Load anns with the specified ids.
213 |         :param ids (int array)       : integer ids specifying anns
214 |         :return: anns (object array) : loaded ann objects
215 |         """
216 |         if type(ids) == list:
217 |             return [self.anns[id] for id in ids]
218 |         elif type(ids) == int:
219 |             return [self.anns[ids]]
220 | 
221 |     def loadCats(self, ids=[]):
222 |         """
223 |         Load cats with the specified ids.
224 |         :param ids (int array)       : integer ids specifying cats
225 |         :return: cats (object array) : loaded cat objects
226 |         """
227 |         if type(ids) == list:
228 |             return [self.cats[id] for id in ids]
229 |         elif type(ids) == int:
230 |             return [self.cats[ids]]
231 | 
232 |     def loadImgs(self, ids=[]):
233 |         """
234 |         Load anns with the specified ids.
235 |         :param ids (int array)       : integer ids specifying img
236 |         :return: imgs (object array) : loaded img objects
237 |         """
238 |         if type(ids) == list:
239 |             return [self.imgs[id] for id in ids]
240 |         elif type(ids) == int:
241 |             return [self.imgs[ids]]
242 | 
243 |     def loadRes(self, resFile):
244 |         """
245 |         Load result file and return a result api object.
246 |         :param   resFile (str)     : file name of result file
247 |         :return: res (obj)         : result api object
248 |         """
249 |         res = COCO()
250 |         res.dataset['images'] = [img for img in self.dataset['images']]
251 |         # res.dataset['info'] = copy.deepcopy(self.dataset['info'])
252 |         # res.dataset['licenses'] = copy.deepcopy(self.dataset['licenses'])
253 | 
254 |         print 'Loading and preparing results...     '
255 |         tic = time.time()
256 |         anns    = json.load(open(resFile))
257 |         assert type(anns) == list, 'results in not an array of objects'
258 |         annsImgIds = [ann['image_id'] for ann in anns]
259 |         assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
260 |                'Results do not correspond to current coco set'
261 |         assert 'caption' in anns[0]
262 |         imgIds = set([img['id'] for img in res.dataset['images']]) & set([ann['image_id'] for ann in anns])
263 |         res.dataset['images'] = [img for img in res.dataset['images'] if img['id'] in imgIds]
264 |         for id, ann in enumerate(anns):
265 |             ann['id'] = id+1
266 |         print 'DONE (t=%0.2fs)'%(time.time()- tic)
267 | 
268 |         res.dataset['annotations'] = anns
269 |         res.createIndex()
270 |         return res
271 | 
272 |     def download( self, tarDir = None, imgIds = [] ):
273 |         '''
274 |         Download COCO images from mscoco.org server.
275 |         :param tarDir (str): COCO results directory name
276 |                imgIds (list): images to be downloaded
277 |         :return:
278 |         '''
279 |         if tarDir is None:
280 |             print 'Please specify target directory'
281 |             return -1
282 |         if len(imgIds) == 0:
283 |             imgs = self.imgs.values()
284 |         else:
285 |             imgs = self.loadImgs(imgIds)
286 |         N = len(imgs)
287 |         if not os.path.exists(tarDir):
288 |             os.makedirs(tarDir)
289 |         for i, img in enumerate(imgs):
290 |             tic = time.time()
291 |             fname = os.path.join(tarDir, img['file_name'])
292 |             if not os.path.exists(fname):
293 |                 urllib.urlretrieve(img['coco_url'], fname)
294 |             print 'downloaded %d/%d images (t=%.1fs)'%(i, N, time.time()- tic)
295 | 
296 |     def process_dataset(self):
297 |         for ann in self.dataset['annotations']:
298 |             q = ann['caption'].lower()
299 |             if q[-1]!='.':
300 |                 q = q + '.'
301 |             ann['caption'] = q
302 | 
303 |     def filter_by_cap_len(self, max_cap_len):
304 |         print("Filtering the captions by length...")
305 |         keep_ann = {}
306 |         keep_img = {}
307 |         for ann in tqdm(self.dataset['annotations']):
308 |             if len(word_tokenize(ann['caption']))<=max_cap_len:
309 |                 keep_ann[ann['id']] = keep_ann.get(ann['id'], 0) + 1
310 |                 keep_img[ann['image_id']] = keep_img.get(ann['image_id'], 0) + 1
311 | 
312 |         self.dataset['annotations'] = \
313 |             [ann for ann in self.dataset['annotations'] \
314 |             if keep_ann.get(ann['id'],0)>0]
315 |         self.dataset['images'] = \
316 |             [img for img in self.dataset['images'] \
317 |             if keep_img.get(img['id'],0)>0]
318 | 
319 |         self.createIndex()
320 | 
321 |     def filter_by_words(self, vocab):
322 |         print("Filtering the captions by words...")
323 |         keep_ann = {}
324 |         keep_img = {}
325 |         for ann in tqdm(self.dataset['annotations']):
326 |             keep_ann[ann['id']] = 1
327 |             words_in_ann = word_tokenize(ann['caption'])
328 |             for word in words_in_ann:
329 |                 if word not in vocab:
330 |                     keep_ann[ann['id']] = 0
331 |                     break
332 |             keep_img[ann['image_id']] = keep_img.get(ann['image_id'], 0) + 1
333 | 
334 |         self.dataset['annotations'] = \
335 |             [ann for ann in self.dataset['annotations'] \
336 |             if keep_ann.get(ann['id'],0)>0]
337 |         self.dataset['images'] = \
338 |             [img for img in self.dataset['images'] \
339 |             if keep_img.get(img['id'],0)>0]
340 | 
341 |         self.createIndex()
342 | 
343 |     def all_captions(self):
344 |         return [ann['caption'] for ann_id, ann in self.anns.items()]
345 | 


--------------------------------------------------------------------------------
/utils/coco/license.txt:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2014, Piotr Dollar and Tsung-Yi Lin
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met: 
 6 | 
 7 | 1. Redistributions of source code must retain the above copyright notice, this
 8 |    list of conditions and the following disclaimer. 
 9 | 2. Redistributions in binary form must reproduce the above copyright notice,
10 |    this list of conditions and the following disclaimer in the documentation
11 |    and/or other materials provided with the distribution. 
12 | 
13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
23 | 
24 | The views and conclusions contained in the software and documentation are those
25 | of the authors and should not be interpreted as representing official policies, 
26 | either expressed or implied, of the FreeBSD Project.
27 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/bleu/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2015 Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/bleu/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/bleu/bleu.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # 
 3 | # File Name : bleu.py
 4 | #
 5 | # Description : Wrapper for BLEU scorer.
 6 | #
 7 | # Creation Date : 06-01-2015
 8 | # Last Modified : Thu 19 Mar 2015 09:13:28 PM PDT
 9 | # Authors : Hao Fang <hfang@uw.edu> and Tsung-Yi Lin <tl483@cornell.edu>
10 | 
11 | from bleu_scorer import BleuScorer
12 | 
13 | 
14 | class Bleu:
15 |     def __init__(self, n=4):
16 |         # default compute Blue score up to 4
17 |         self._n = n
18 |         self._hypo_for_image = {}
19 |         self.ref_for_image = {}
20 | 
21 |     def compute_score(self, gts, res):
22 | 
23 |         assert(gts.keys() == res.keys())
24 |         imgIds = gts.keys()
25 | 
26 |         bleu_scorer = BleuScorer(n=self._n)
27 |         for id in imgIds:
28 |             hypo = res[id]
29 |             ref = gts[id]
30 | 
31 |             # Sanity check.
32 |             assert(type(hypo) is list)
33 |             assert(len(hypo) == 1)
34 |             assert(type(ref) is list)
35 |             assert(len(ref) >= 1)
36 | 
37 |             bleu_scorer += (hypo[0], ref)
38 | 
39 |         #score, scores = bleu_scorer.compute_score(option='shortest')
40 |         score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
41 |         #score, scores = bleu_scorer.compute_score(option='average', verbose=1)
42 | 
43 |         # return (bleu, bleu_info)
44 |         return score, scores
45 | 
46 |     def method(self):
47 |         return "Bleu"
48 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/bleu/bleu_scorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | # bleu_scorer.py
  4 | # David Chiang <chiang@isi.edu>
  5 | 
  6 | # Copyright (c) 2004-2006 University of Maryland. All rights
  7 | # reserved. Do not redistribute without permission from the
  8 | # author. Not for commercial use.
  9 | 
 10 | # Modified by: 
 11 | # Hao Fang <hfang@uw.edu>
 12 | # Tsung-Yi Lin <tl483@cornell.edu>
 13 | 
 14 | '''Provides:
 15 | cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
 16 | cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
 17 | '''
 18 | 
 19 | import copy
 20 | import sys, math, re
 21 | from collections import defaultdict
 22 | 
 23 | def precook(s, n=4, out=False):
 24 |     """Takes a string as input and returns an object that can be given to
 25 |     either cook_refs or cook_test. This is optional: cook_refs and cook_test
 26 |     can take string arguments as well."""
 27 |     words = s.split()
 28 |     counts = defaultdict(int)
 29 |     for k in xrange(1,n+1):
 30 |         for i in xrange(len(words)-k+1):
 31 |             ngram = tuple(words[i:i+k])
 32 |             counts[ngram] += 1
 33 |     return (len(words), counts)
 34 | 
 35 | def cook_refs(refs, eff=None, n=4): ## lhuang: oracle will call with "average"
 36 |     '''Takes a list of reference sentences for a single segment
 37 |     and returns an object that encapsulates everything that BLEU
 38 |     needs to know about them.'''
 39 | 
 40 |     reflen = []
 41 |     maxcounts = {}
 42 |     for ref in refs:
 43 |         rl, counts = precook(ref, n)
 44 |         reflen.append(rl)
 45 |         for (ngram,count) in counts.iteritems():
 46 |             maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
 47 | 
 48 |     # Calculate effective reference sentence length.
 49 |     if eff == "shortest":
 50 |         reflen = min(reflen)
 51 |     elif eff == "average":
 52 |         reflen = float(sum(reflen))/len(reflen)
 53 | 
 54 |     ## lhuang: N.B.: leave reflen computaiton to the very end!!
 55 |     
 56 |     ## lhuang: N.B.: in case of "closest", keep a list of reflens!! (bad design)
 57 | 
 58 |     return (reflen, maxcounts)
 59 | 
 60 | def cook_test(test, (reflen, refmaxcounts), eff=None, n=4):
 61 |     '''Takes a test sentence and returns an object that
 62 |     encapsulates everything that BLEU needs to know about it.'''
 63 | 
 64 |     testlen, counts = precook(test, n, True)
 65 | 
 66 |     result = {}
 67 | 
 68 |     # Calculate effective reference sentence length.
 69 |     
 70 |     if eff == "closest":
 71 |         result["reflen"] = min((abs(l-testlen), l) for l in reflen)[1]
 72 |     else: ## i.e., "average" or "shortest" or None
 73 |         result["reflen"] = reflen
 74 | 
 75 |     result["testlen"] = testlen
 76 | 
 77 |     result["guess"] = [max(0,testlen-k+1) for k in xrange(1,n+1)]
 78 | 
 79 |     result['correct'] = [0]*n
 80 |     for (ngram, count) in counts.iteritems():
 81 |         result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count)
 82 | 
 83 |     return result
 84 | 
 85 | class BleuScorer(object):
 86 |     """Bleu scorer.
 87 |     """
 88 | 
 89 |     __slots__ = "n", "crefs", "ctest", "_score", "_ratio", "_testlen", "_reflen", "special_reflen"
 90 |     # special_reflen is used in oracle (proportional effective ref len for a node).
 91 | 
 92 |     def copy(self):
 93 |         ''' copy the refs.'''
 94 |         new = BleuScorer(n=self.n)
 95 |         new.ctest = copy.copy(self.ctest)
 96 |         new.crefs = copy.copy(self.crefs)
 97 |         new._score = None
 98 |         return new
 99 | 
100 |     def __init__(self, test=None, refs=None, n=4, special_reflen=None):
101 |         ''' singular instance '''
102 | 
103 |         self.n = n
104 |         self.crefs = []
105 |         self.ctest = []
106 |         self.cook_append(test, refs)
107 |         self.special_reflen = special_reflen
108 | 
109 |     def cook_append(self, test, refs):
110 |         '''called by constructor and __iadd__ to avoid creating new instances.'''
111 |         
112 |         if refs is not None:
113 |             self.crefs.append(cook_refs(refs))
114 |             if test is not None:
115 |                 cooked_test = cook_test(test, self.crefs[-1])
116 |                 self.ctest.append(cooked_test) ## N.B.: -1
117 |             else:
118 |                 self.ctest.append(None) # lens of crefs and ctest have to match
119 | 
120 |         self._score = None ## need to recompute
121 | 
122 |     def ratio(self, option=None):
123 |         self.compute_score(option=option)
124 |         return self._ratio
125 | 
126 |     def score_ratio(self, option=None):
127 |         '''return (bleu, len_ratio) pair'''
128 |         return (self.fscore(option=option), self.ratio(option=option))
129 | 
130 |     def score_ratio_str(self, option=None):
131 |         return "%.4f (%.2f)" % self.score_ratio(option)
132 | 
133 |     def reflen(self, option=None):
134 |         self.compute_score(option=option)
135 |         return self._reflen
136 | 
137 |     def testlen(self, option=None):
138 |         self.compute_score(option=option)
139 |         return self._testlen        
140 | 
141 |     def retest(self, new_test):
142 |         if type(new_test) is str:
143 |             new_test = [new_test]
144 |         assert len(new_test) == len(self.crefs), new_test
145 |         self.ctest = []
146 |         for t, rs in zip(new_test, self.crefs):
147 |             self.ctest.append(cook_test(t, rs))
148 |         self._score = None
149 | 
150 |         return self
151 | 
152 |     def rescore(self, new_test):
153 |         ''' replace test(s) with new test(s), and returns the new score.'''
154 |         
155 |         return self.retest(new_test).compute_score()
156 | 
157 |     def size(self):
158 |         assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest))
159 |         return len(self.crefs)
160 | 
161 |     def __iadd__(self, other):
162 |         '''add an instance (e.g., from another sentence).'''
163 | 
164 |         if type(other) is tuple:
165 |             ## avoid creating new BleuScorer instances
166 |             self.cook_append(other[0], other[1])
167 |         else:
168 |             assert self.compatible(other), "incompatible BLEUs."
169 |             self.ctest.extend(other.ctest)
170 |             self.crefs.extend(other.crefs)
171 |             self._score = None ## need to recompute
172 | 
173 |         return self        
174 | 
175 |     def compatible(self, other):
176 |         return isinstance(other, BleuScorer) and self.n == other.n
177 | 
178 |     def single_reflen(self, option="average"):
179 |         return self._single_reflen(self.crefs[0][0], option)
180 | 
181 |     def _single_reflen(self, reflens, option=None, testlen=None):
182 |         
183 |         if option == "shortest":
184 |             reflen = min(reflens)
185 |         elif option == "average":
186 |             reflen = float(sum(reflens))/len(reflens)
187 |         elif option == "closest":
188 |             reflen = min((abs(l-testlen), l) for l in reflens)[1]
189 |         else:
190 |             assert False, "unsupported reflen option %s" % option
191 | 
192 |         return reflen
193 | 
194 |     def recompute_score(self, option=None, verbose=0):
195 |         self._score = None
196 |         return self.compute_score(option, verbose)
197 |         
198 |     def compute_score(self, option=None, verbose=0):
199 |         n = self.n
200 |         small = 1e-9
201 |         tiny = 1e-15 ## so that if guess is 0 still return 0
202 |         bleu_list = [[] for _ in range(n)]
203 | 
204 |         if self._score is not None:
205 |             return self._score
206 | 
207 |         if option is None:
208 |             option = "average" if len(self.crefs) == 1 else "closest"
209 | 
210 |         self._testlen = 0
211 |         self._reflen = 0
212 |         totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n}
213 | 
214 |         # for each sentence
215 |         for comps in self.ctest:            
216 |             testlen = comps['testlen']
217 |             self._testlen += testlen
218 | 
219 |             if self.special_reflen is None: ## need computation
220 |                 reflen = self._single_reflen(comps['reflen'], option, testlen)
221 |             else:
222 |                 reflen = self.special_reflen
223 | 
224 |             self._reflen += reflen
225 |                 
226 |             for key in ['guess','correct']:
227 |                 for k in xrange(n):
228 |                     totalcomps[key][k] += comps[key][k]
229 | 
230 |             # append per image bleu score
231 |             bleu = 1.
232 |             for k in xrange(n):
233 |                 bleu *= (float(comps['correct'][k]) + tiny) \
234 |                         /(float(comps['guess'][k]) + small) 
235 |                 bleu_list[k].append(bleu ** (1./(k+1)))
236 |             ratio = (testlen + tiny) / (reflen + small) ## N.B.: avoid zero division
237 |             if ratio < 1:
238 |                 for k in xrange(n):
239 |                     bleu_list[k][-1] *= math.exp(1 - 1/ratio)
240 | 
241 |             if verbose > 1:
242 |                 print comps, reflen
243 | 
244 |         totalcomps['reflen'] = self._reflen
245 |         totalcomps['testlen'] = self._testlen
246 | 
247 |         bleus = []
248 |         bleu = 1.
249 |         for k in xrange(n):
250 |             bleu *= float(totalcomps['correct'][k] + tiny) \
251 |                     / (totalcomps['guess'][k] + small)
252 |             bleus.append(bleu ** (1./(k+1)))
253 |         ratio = (self._testlen + tiny) / (self._reflen + small) ## N.B.: avoid zero division
254 |         if ratio < 1:
255 |             for k in xrange(n):
256 |                 bleus[k] *= math.exp(1 - 1/ratio)
257 | 
258 |         if verbose > 0:
259 |             print totalcomps
260 |             print "ratio:", ratio
261 | 
262 |         self._score = bleus
263 |         return self._score, bleu_list
264 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/cider/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/cider/cider.py:
--------------------------------------------------------------------------------
 1 | # Filename: cider.py
 2 | #
 3 | # Description: Describes the class to compute the CIDEr (Consensus-Based Image Description Evaluation) Metric 
 4 | #               by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726)
 5 | #
 6 | # Creation Date: Sun Feb  8 14:16:54 2015
 7 | #
 8 | # Authors: Ramakrishna Vedantam <vrama91@vt.edu> and Tsung-Yi Lin <tl483@cornell.edu>
 9 | 
10 | from cider_scorer import CiderScorer
11 | import pdb
12 | 
13 | class Cider:
14 |     """
15 |     Main Class to compute the CIDEr metric 
16 | 
17 |     """
18 |     def __init__(self, test=None, refs=None, n=4, sigma=6.0):
19 |         # set cider to sum over 1 to 4-grams
20 |         self._n = n
21 |         # set the standard deviation parameter for gaussian penalty
22 |         self._sigma = sigma
23 | 
24 |     def compute_score(self, gts, res):
25 |         """
26 |         Main function to compute CIDEr score
27 |         :param  hypo_for_image (dict) : dictionary with key <image> and value <tokenized hypothesis / candidate sentence>
28 |                 ref_for_image (dict)  : dictionary with key <image> and value <tokenized reference sentence>
29 |         :return: cider (float) : computed CIDEr score for the corpus 
30 |         """
31 | 
32 |         assert(gts.keys() == res.keys())
33 |         imgIds = gts.keys()
34 | 
35 |         cider_scorer = CiderScorer(n=self._n, sigma=self._sigma)
36 | 
37 |         for id in imgIds:
38 |             hypo = res[id]
39 |             ref = gts[id]
40 | 
41 |             # Sanity check.
42 |             assert(type(hypo) is list)
43 |             assert(len(hypo) == 1)
44 |             assert(type(ref) is list)
45 |             assert(len(ref) > 0)
46 | 
47 |             cider_scorer += (hypo[0], ref)
48 | 
49 |         (score, scores) = cider_scorer.compute_score()
50 | 
51 |         return score, scores
52 | 
53 |     def method(self):
54 |         return "CIDEr"


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/cider/cider_scorer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # Tsung-Yi Lin <tl483@cornell.edu>
  3 | # Ramakrishna Vedantam <vrama91@vt.edu>
  4 | 
  5 | import copy
  6 | from collections import defaultdict
  7 | import numpy as np
  8 | import pdb
  9 | import math
 10 | 
 11 | def precook(s, n=4, out=False):
 12 |     """
 13 |     Takes a string as input and returns an object that can be given to
 14 |     either cook_refs or cook_test. This is optional: cook_refs and cook_test
 15 |     can take string arguments as well.
 16 |     :param s: string : sentence to be converted into ngrams
 17 |     :param n: int    : number of ngrams for which representation is calculated
 18 |     :return: term frequency vector for occuring ngrams
 19 |     """
 20 |     words = s.split()
 21 |     counts = defaultdict(int)
 22 |     for k in xrange(1,n+1):
 23 |         for i in xrange(len(words)-k+1):
 24 |             ngram = tuple(words[i:i+k])
 25 |             counts[ngram] += 1
 26 |     return counts
 27 | 
 28 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average"
 29 |     '''Takes a list of reference sentences for a single segment
 30 |     and returns an object that encapsulates everything that BLEU
 31 |     needs to know about them.
 32 |     :param refs: list of string : reference sentences for some image
 33 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 34 |     :return: result (list of dict)
 35 |     '''
 36 |     return [precook(ref, n) for ref in refs]
 37 | 
 38 | def cook_test(test, n=4):
 39 |     '''Takes a test sentence and returns an object that
 40 |     encapsulates everything that BLEU needs to know about it.
 41 |     :param test: list of string : hypothesis sentence for some image
 42 |     :param n: int : number of ngrams for which (ngram) representation is calculated
 43 |     :return: result (dict)
 44 |     '''
 45 |     return precook(test, n, True)
 46 | 
 47 | class CiderScorer(object):
 48 |     """CIDEr scorer.
 49 |     """
 50 | 
 51 |     def copy(self):
 52 |         ''' copy the refs.'''
 53 |         new = CiderScorer(n=self.n)
 54 |         new.ctest = copy.copy(self.ctest)
 55 |         new.crefs = copy.copy(self.crefs)
 56 |         return new
 57 | 
 58 |     def __init__(self, test=None, refs=None, n=4, sigma=6.0):
 59 |         ''' singular instance '''
 60 |         self.n = n
 61 |         self.sigma = sigma
 62 |         self.crefs = []
 63 |         self.ctest = []
 64 |         self.document_frequency = defaultdict(float)
 65 |         self.cook_append(test, refs)
 66 |         self.ref_len = None
 67 | 
 68 |     def cook_append(self, test, refs):
 69 |         '''called by constructor and __iadd__ to avoid creating new instances.'''
 70 | 
 71 |         if refs is not None:
 72 |             self.crefs.append(cook_refs(refs))
 73 |             if test is not None:
 74 |                 self.ctest.append(cook_test(test)) ## N.B.: -1
 75 |             else:
 76 |                 self.ctest.append(None) # lens of crefs and ctest have to match
 77 | 
 78 |     def size(self):
 79 |         assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest))
 80 |         return len(self.crefs)
 81 | 
 82 |     def __iadd__(self, other):
 83 |         '''add an instance (e.g., from another sentence).'''
 84 | 
 85 |         if type(other) is tuple:
 86 |             ## avoid creating new CiderScorer instances
 87 |             self.cook_append(other[0], other[1])
 88 |         else:
 89 |             self.ctest.extend(other.ctest)
 90 |             self.crefs.extend(other.crefs)
 91 | 
 92 |         return self
 93 |     def compute_doc_freq(self):
 94 |         '''
 95 |         Compute term frequency for reference data.
 96 |         This will be used to compute idf (inverse document frequency later)
 97 |         The term frequency is stored in the object
 98 |         :return: None
 99 |         '''
100 |         for refs in self.crefs:
101 |             # refs, k ref captions of one image
102 |             for ngram in set([ngram for ref in refs for (ngram,count) in ref.iteritems()]):
103 |                 self.document_frequency[ngram] += 1
104 |             # maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
105 | 
106 |     def compute_cider(self):
107 |         def counts2vec(cnts):
108 |             """
109 |             Function maps counts of ngram to vector of tfidf weights.
110 |             The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights.
111 |             The n-th entry of array denotes length of n-grams.
112 |             :param cnts:
113 |             :return: vec (array of dict), norm (array of float), length (int)
114 |             """
115 |             vec = [defaultdict(float) for _ in range(self.n)]
116 |             length = 0
117 |             norm = [0.0 for _ in range(self.n)]
118 |             for (ngram,term_freq) in cnts.iteritems():
119 |                 # give word count 1 if it doesn't appear in reference corpus
120 |                 df = np.log(max(1.0, self.document_frequency[ngram]))
121 |                 # ngram index
122 |                 n = len(ngram)-1
123 |                 # tf (term_freq) * idf (precomputed idf) for n-grams
124 |                 vec[n][ngram] = float(term_freq)*(self.ref_len - df)
125 |                 # compute norm for the vector.  the norm will be used for computing similarity
126 |                 norm[n] += pow(vec[n][ngram], 2)
127 | 
128 |                 if n == 1:
129 |                     length += term_freq
130 |             norm = [np.sqrt(n) for n in norm]
131 |             return vec, norm, length
132 | 
133 |         def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref):
134 |             '''
135 |             Compute the cosine similarity of two vectors.
136 |             :param vec_hyp: array of dictionary for vector corresponding to hypothesis
137 |             :param vec_ref: array of dictionary for vector corresponding to reference
138 |             :param norm_hyp: array of float for vector corresponding to hypothesis
139 |             :param norm_ref: array of float for vector corresponding to reference
140 |             :param length_hyp: int containing length of hypothesis
141 |             :param length_ref: int containing length of reference
142 |             :return: array of score for each n-grams cosine similarity
143 |             '''
144 |             delta = float(length_hyp - length_ref)
145 |             # measure consine similarity
146 |             val = np.array([0.0 for _ in range(self.n)])
147 |             for n in range(self.n):
148 |                 # ngram
149 |                 for (ngram,count) in vec_hyp[n].iteritems():
150 |                     # vrama91 : added clipping
151 |                     val[n] += min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram]
152 | 
153 |                 if (norm_hyp[n] != 0) and (norm_ref[n] != 0):
154 |                     val[n] /= (norm_hyp[n]*norm_ref[n])
155 | 
156 |                 assert(not math.isnan(val[n]))
157 |                 # vrama91: added a length based gaussian penalty
158 |                 val[n] *= np.e**(-(delta**2)/(2*self.sigma**2))
159 |             return val
160 | 
161 |         # compute log reference length
162 |         self.ref_len = np.log(float(len(self.crefs)))
163 | 
164 |         scores = []
165 |         for test, refs in zip(self.ctest, self.crefs):
166 |             # compute vector for test captions
167 |             vec, norm, length = counts2vec(test)
168 |             # compute vector for ref captions
169 |             score = np.array([0.0 for _ in range(self.n)])
170 |             for ref in refs:
171 |                 vec_ref, norm_ref, length_ref = counts2vec(ref)
172 |                 score += sim(vec, vec_ref, norm, norm_ref, length, length_ref)
173 |             # change by vrama91 - mean of ngram scores, instead of sum
174 |             score_avg = np.mean(score)
175 |             # divide by number of references
176 |             score_avg /= len(refs)
177 |             # multiply score by 10
178 |             score_avg *= 10.0
179 |             # append score of an image to the score list
180 |             scores.append(score_avg)
181 |         return scores
182 | 
183 |     def compute_score(self, option=None, verbose=0):
184 |         # compute idf
185 |         self.compute_doc_freq()
186 |         # assert to check document frequency
187 |         assert(len(self.ctest) >= max(self.document_frequency.values()))
188 |         # compute cider score
189 |         score = self.compute_cider()
190 |         # debug
191 |         # print score
192 |         return np.mean(np.array(score)), np.array(score)


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/eval.py:
--------------------------------------------------------------------------------
 1 | __author__ = 'tylin'
 2 | from tokenizer.ptbtokenizer import PTBTokenizer
 3 | from bleu.bleu import Bleu
 4 | from meteor.meteor import Meteor
 5 | from rouge.rouge import Rouge
 6 | from cider.cider import Cider
 7 | 
 8 | class COCOEvalCap:
 9 |     def __init__(self, coco, cocoRes):
10 |         self.evalImgs = []
11 |         self.eval = {}
12 |         self.imgToEval = {}
13 |         self.coco = coco
14 |         self.cocoRes = cocoRes
15 |         self.params = {'image_id': coco.getImgIds()}
16 | 
17 |     def evaluate(self):
18 |         imgIds = self.params['image_id']
19 |         # imgIds = self.coco.getImgIds()
20 |         gts = {}
21 |         res = {}
22 |         for imgId in imgIds:
23 |             gts[imgId] = self.coco.imgToAnns[imgId]
24 |             res[imgId] = self.cocoRes.imgToAnns[imgId]
25 | 
26 |         # =================================================
27 |         # Set up scorers
28 |         # =================================================
29 |         print 'tokenization...'
30 |         tokenizer = PTBTokenizer()
31 |         gts  = tokenizer.tokenize(gts)
32 |         res = tokenizer.tokenize(res)
33 | 
34 |         # =================================================
35 |         # Set up scorers
36 |         # =================================================
37 |         print 'setting up scorers...'
38 |         scorers = [
39 |             (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
40 |             (Meteor(),"METEOR"),
41 |             (Rouge(), "ROUGE_L"),
42 |             (Cider(), "CIDEr")
43 |         ]
44 | 
45 |         # =================================================
46 |         # Compute scores
47 |         # =================================================
48 |         for scorer, method in scorers:
49 |             print 'computing %s score...'%(scorer.method())
50 |             score, scores = scorer.compute_score(gts, res)
51 |             if type(method) == list:
52 |                 for sc, scs, m in zip(score, scores, method):
53 |                     self.setEval(sc, m)
54 |                     self.setImgToEvalImgs(scs, gts.keys(), m)
55 |                     print "%s: %0.3f"%(m, sc)
56 |             else:
57 |                 self.setEval(score, method)
58 |                 self.setImgToEvalImgs(scores, gts.keys(), method)
59 |                 print "%s: %0.3f"%(method, score)
60 |         self.setEvalImgs()
61 | 
62 |     def setEval(self, score, method):
63 |         self.eval[method] = score
64 | 
65 |     def setImgToEvalImgs(self, scores, imgIds, method):
66 |         for imgId, score in zip(imgIds, scores):
67 |             if not imgId in self.imgToEval:
68 |                 self.imgToEval[imgId] = {}
69 |                 self.imgToEval[imgId]["image_id"] = imgId
70 |             self.imgToEval[imgId][method] = score
71 | 
72 |     def setEvalImgs(self):
73 |         self.evalImgs = [eval for imgId, eval in self.imgToEval.items()]


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/meteor/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'tylin'
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/meteor/data/paraphrase-en.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/coco/pycocoevalcap/meteor/data/paraphrase-en.gz


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/meteor/meteor-1.5.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/coco/pycocoevalcap/meteor/meteor-1.5.jar


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/meteor/meteor.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # Python wrapper for METEOR implementation, by Xinlei Chen
 4 | # Acknowledge Michael Denkowski for the generous discussion and help 
 5 | 
 6 | import os
 7 | import sys
 8 | import subprocess
 9 | import threading
10 | 
11 | # Assumes meteor-1.5.jar is in the same directory as meteor.py.  Change as needed.
12 | METEOR_JAR = 'meteor-1.5.jar'
13 | # print METEOR_JAR
14 | 
15 | class Meteor:
16 | 
17 |     def __init__(self):
18 |         self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, \
19 |                 '-', '-', '-stdio', '-l', 'en', '-norm']
20 |         self.meteor_p = subprocess.Popen(self.meteor_cmd, \
21 |                 cwd=os.path.dirname(os.path.abspath(__file__)), \
22 |                 stdin=subprocess.PIPE, \
23 |                 stdout=subprocess.PIPE, \
24 |                 stderr=subprocess.PIPE)
25 |         # Used to guarantee thread safety
26 |         self.lock = threading.Lock()
27 | 
28 |     def compute_score(self, gts, res):
29 |         assert(gts.keys() == res.keys())
30 |         imgIds = gts.keys()
31 |         scores = []
32 | 
33 |         eval_line = 'EVAL'
34 |         self.lock.acquire()
35 |         for i in imgIds:
36 |             assert(len(res[i]) == 1)
37 |             stat = self._stat(res[i][0], gts[i])
38 |             eval_line += ' ||| {}'.format(stat)
39 | 
40 |         self.meteor_p.stdin.write('{}\n'.format(eval_line))
41 |         for i in range(0,len(imgIds)):
42 |             scores.append(float(self.meteor_p.stdout.readline().strip()))
43 |         score = float(self.meteor_p.stdout.readline().strip())
44 |         self.lock.release()
45 | 
46 |         return score, scores
47 | 
48 |     def method(self):
49 |         return "METEOR"
50 | 
51 |     def _stat(self, hypothesis_str, reference_list):
52 |         # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words
53 |         hypothesis_str = hypothesis_str.replace('|||','').replace('  ',' ')
54 |         score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str))
55 |         self.meteor_p.stdin.write('{}\n'.format(score_line))
56 |         return self.meteor_p.stdout.readline().strip()
57 | 
58 |     def _score(self, hypothesis_str, reference_list):
59 |         self.lock.acquire()
60 |         # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words
61 |         hypothesis_str = hypothesis_str.replace('|||','').replace('  ',' ')
62 |         score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str))
63 |         self.meteor_p.stdin.write('{}\n'.format(score_line))
64 |         stats = self.meteor_p.stdout.readline().strip()
65 |         eval_line = 'EVAL ||| {}'.format(stats)
66 |         # EVAL ||| stats 
67 |         self.meteor_p.stdin.write('{}\n'.format(eval_line))
68 |         score = float(self.meteor_p.stdout.readline().strip())
69 |         # bug fix: there are two values returned by the jar file, one average, and one all, so do it twice
70 |         # thanks for Andrej for pointing this out
71 |         score = float(self.meteor_p.stdout.readline().strip())
72 |         self.lock.release()
73 |         return score
74 |  
75 |     def __exit__(self):
76 |         self.lock.acquire()
77 |         self.meteor_p.stdin.close()
78 |         self.meteor_p.kill()
79 |         self.meteor_p.wait()
80 |         self.lock.release()
81 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/readme.md:
--------------------------------------------------------------------------------
1 | This is the MS COCO caption evaluation API downloaded from https://github.com/tylin/coco-caption. 
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/rouge/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'vrama91'
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/rouge/rouge.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # 
  3 | # File Name : rouge.py
  4 | #
  5 | # Description : Computes ROUGE-L metric as described by Lin and Hovey (2004)
  6 | #
  7 | # Creation Date : 2015-01-07 06:03
  8 | # Author : Ramakrishna Vedantam <vrama91@vt.edu>
  9 | 
 10 | import numpy as np
 11 | import pdb
 12 | 
 13 | def my_lcs(string, sub):
 14 |     """
 15 |     Calculates longest common subsequence for a pair of tokenized strings
 16 |     :param string : list of str : tokens from a string split using whitespace
 17 |     :param sub : list of str : shorter string, also split using whitespace
 18 |     :returns: length (list of int): length of the longest common subsequence between the two strings
 19 | 
 20 |     Note: my_lcs only gives length of the longest common subsequence, not the actual LCS
 21 |     """
 22 |     if(len(string)< len(sub)):
 23 |         sub, string = string, sub
 24 | 
 25 |     lengths = [[0 for i in range(0,len(sub)+1)] for j in range(0,len(string)+1)]
 26 | 
 27 |     for j in range(1,len(sub)+1):
 28 |         for i in range(1,len(string)+1):
 29 |             if(string[i-1] == sub[j-1]):
 30 |                 lengths[i][j] = lengths[i-1][j-1] + 1
 31 |             else:
 32 |                 lengths[i][j] = max(lengths[i-1][j] , lengths[i][j-1])
 33 | 
 34 |     return lengths[len(string)][len(sub)]
 35 | 
 36 | class Rouge():
 37 |     '''
 38 |     Class for computing ROUGE-L score for a set of candidate sentences for the MS COCO test set
 39 | 
 40 |     '''
 41 |     def __init__(self):
 42 |         # vrama91: updated the value below based on discussion with Hovey
 43 |         self.beta = 1.2
 44 | 
 45 |     def calc_score(self, candidate, refs):
 46 |         """
 47 |         Compute ROUGE-L score given one candidate and references for an image
 48 |         :param candidate: str : candidate sentence to be evaluated
 49 |         :param refs: list of str : COCO reference sentences for the particular image to be evaluated
 50 |         :returns score: int (ROUGE-L score for the candidate evaluated against references)
 51 |         """
 52 |         assert(len(candidate)==1)	
 53 |         assert(len(refs)>0)         
 54 |         prec = []
 55 |         rec = []
 56 | 
 57 |         # split into tokens
 58 |         token_c = candidate[0].split(" ")
 59 |     	
 60 |         for reference in refs:
 61 |             # split into tokens
 62 |             token_r = reference.split(" ")
 63 |             # compute the longest common subsequence
 64 |             lcs = my_lcs(token_r, token_c)
 65 |             prec.append(lcs/float(len(token_c)))
 66 |             rec.append(lcs/float(len(token_r)))
 67 | 
 68 |         prec_max = max(prec)
 69 |         rec_max = max(rec)
 70 | 
 71 |         if(prec_max!=0 and rec_max !=0):
 72 |             score = ((1 + self.beta**2)*prec_max*rec_max)/float(rec_max + self.beta**2*prec_max)
 73 |         else:
 74 |             score = 0.0
 75 |         return score
 76 | 
 77 |     def compute_score(self, gts, res):
 78 |         """
 79 |         Computes Rouge-L score given a set of reference and candidate sentences for the dataset
 80 |         Invoked by evaluate_captions.py 
 81 |         :param hypo_for_image: dict : candidate / test sentences with "image name" key and "tokenized sentences" as values 
 82 |         :param ref_for_image: dict : reference MS-COCO sentences with "image name" key and "tokenized sentences" as values
 83 |         :returns: average_score: float (mean ROUGE-L score computed by averaging scores for all the images)
 84 |         """
 85 |         assert(gts.keys() == res.keys())
 86 |         imgIds = gts.keys()
 87 | 
 88 |         score = []
 89 |         for id in imgIds:
 90 |             hypo = res[id]
 91 |             ref  = gts[id]
 92 | 
 93 |             score.append(self.calc_score(hypo, ref))
 94 | 
 95 |             # Sanity check.
 96 |             assert(type(hypo) is list)
 97 |             assert(len(hypo) == 1)
 98 |             assert(type(ref) is list)
 99 |             assert(len(ref) > 0)
100 | 
101 |         average_score = np.mean(np.array(score))
102 |         return average_score, np.array(score)
103 | 
104 |     def method(self):
105 |         return "Rouge"
106 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/tokenizer/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'hfang'
2 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/tokenizer/ptbtokenizer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # 
 3 | # File Name : ptbtokenizer.py
 4 | #
 5 | # Description : Do the PTB Tokenization and remove punctuations.
 6 | #
 7 | # Creation Date : 29-12-2014
 8 | # Last Modified : Thu Mar 19 09:53:35 2015
 9 | # Authors : Hao Fang <hfang@uw.edu> and Tsung-Yi Lin <tl483@cornell.edu>
10 | 
11 | import os
12 | import sys
13 | import subprocess
14 | import tempfile
15 | import itertools
16 | 
17 | # path to the stanford corenlp jar
18 | STANFORD_CORENLP_3_4_1_JAR = 'stanford-corenlp-3.4.1.jar'
19 | 
20 | # punctuations to be removed from the sentences
21 | PUNCTUATIONS = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \
22 |         ".", "?", "!", ",", ":", "-", "--", "...", ";"] 
23 | 
24 | class PTBTokenizer:
25 |     """Python wrapper of Stanford PTBTokenizer"""
26 | 
27 |     def tokenize(self, captions_for_image):
28 |         cmd = ['java', '-cp', STANFORD_CORENLP_3_4_1_JAR, \
29 |                 'edu.stanford.nlp.process.PTBTokenizer', \
30 |                 '-preserveLines', '-lowerCase']
31 | 
32 |         # ======================================================
33 |         # prepare data for PTB Tokenizer
34 |         # ======================================================
35 |         final_tokenized_captions_for_image = {}
36 |         image_id = [k for k, v in captions_for_image.items() for _ in range(len(v))]
37 |         sentences = '\n'.join([c['caption'].replace('\n', ' ') for k, v in captions_for_image.items() for c in v])
38 | 
39 |         # ======================================================
40 |         # save sentences to temporary file
41 |         # ======================================================
42 |         path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__))
43 |         tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname)
44 |         tmp_file.write(sentences)
45 |         tmp_file.close()
46 | 
47 |         # ======================================================
48 |         # tokenize sentence
49 |         # ======================================================
50 |         cmd.append(os.path.basename(tmp_file.name))
51 |         p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \
52 |                 stdout=subprocess.PIPE)
53 |         token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0]
54 |         lines = token_lines.split('\n')
55 |         # remove temp file
56 |         os.remove(tmp_file.name)
57 | 
58 |         # ======================================================
59 |         # create dictionary for tokenized captions
60 |         # ======================================================
61 |         for k, line in zip(image_id, lines):
62 |             if not k in final_tokenized_captions_for_image:
63 |                 final_tokenized_captions_for_image[k] = []
64 |             tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
65 |                     if w not in PUNCTUATIONS])
66 |             final_tokenized_captions_for_image[k].append(tokenized_caption)
67 | 
68 |         return final_tokenized_captions_for_image
69 | 


--------------------------------------------------------------------------------
/utils/coco/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/coco/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar


--------------------------------------------------------------------------------
/utils/coco/readme.md:
--------------------------------------------------------------------------------
1 | This is the MS COCO API downloaded from https://github.com/pdollar/coco. I have slightly modified it for convenience reasons.
2 | 


--------------------------------------------------------------------------------
/utils/ilsvrc_2012_mean.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/ilsvrc_2012_mean.npy


--------------------------------------------------------------------------------
/utils/misc.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import numpy as np
 3 | import cv2
 4 | import heapq
 5 | 
 6 | class ImageLoader(object):
 7 |     def __init__(self, mean_file):
 8 |         self.bgr = True
 9 |         self.scale_shape = np.array([224, 224], np.int32)
10 |         self.crop_shape = np.array([224, 224], np.int32)
11 |         self.mean = np.load(mean_file).mean(1).mean(1)
12 | 
13 |     def load_image(self, image_file):
14 |         """ Load and preprocess an image. """
15 |         image = cv2.imread(image_file)
16 | 
17 |         if self.bgr:
18 |             temp = image.swapaxes(0, 2)
19 |             temp = temp[::-1]
20 |             image = temp.swapaxes(0, 2)
21 | 
22 |         image = cv2.resize(image, (self.scale_shape[0], self.scale_shape[1]))
23 |         offset = (self.scale_shape - self.crop_shape) / 2
24 |         offset = offset.astype(np.int32)
25 |         image = image[offset[0]:offset[0]+self.crop_shape[0],
26 |                       offset[1]:offset[1]+self.crop_shape[1]]
27 |         image = image - self.mean
28 |         return image
29 | 
30 |     def load_images(self, image_files):
31 |         """ Load and preprocess a list of images. """
32 |         images = []
33 |         for image_file in image_files:
34 |             images.append(self.load_image(image_file))
35 |         images = np.array(images, np.float32)
36 |         return images
37 | 
38 | class CaptionData(object):
39 |     def __init__(self, sentence, memory, output, score):
40 |        self.sentence = sentence
41 |        self.memory = memory
42 |        self.output = output
43 |        self.score = score
44 | 
45 |     def __cmp__(self, other):
46 |         assert isinstance(other, CaptionData)
47 |         if self.score == other.score:
48 |             return 0
49 |         elif self.score < other.score:
50 |             return -1
51 |         else:
52 |             return 1
53 | 
54 |     def __lt__(self, other):
55 |         assert isinstance(other, CaptionData)
56 |         return self.score < other.score
57 | 
58 |     def __eq__(self, other):
59 |         assert isinstance(other, CaptionData)
60 |         return self.score == other.score
61 | 
62 | class TopN(object):
63 |     def __init__(self, n):
64 |         self._n = n
65 |         self._data = []
66 | 
67 |     def size(self):
68 |         assert self._data is not None
69 |         return len(self._data)
70 | 
71 |     def push(self, x):
72 |         assert self._data is not None
73 |         if len(self._data) < self._n:
74 |             heapq.heappush(self._data, x)
75 |         else:
76 |             heapq.heappushpop(self._data, x)
77 | 
78 |     def extract(self, sort=False):
79 |         assert self._data is not None
80 |         data = self._data
81 |         self._data = None
82 |         if sort:
83 |             data.sort(reverse=True)
84 |         return data
85 | 
86 |     def reset(self):
87 |         self._data = []
88 | 


--------------------------------------------------------------------------------
/utils/nn.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import tensorflow.contrib.layers as layers
  3 | 
  4 | class NN(object):
  5 |     def __init__(self, config):
  6 |         self.config = config
  7 |         self.is_train = True if config.phase == 'train' else False
  8 |         self.train_cnn = self.is_train and config.train_cnn
  9 |         self.prepare()
 10 | 
 11 |     def prepare(self):
 12 |         """ Setup the weight initalizers and regularizers. """
 13 |         config = self.config
 14 | 
 15 |         self.conv_kernel_initializer = layers.xavier_initializer()
 16 | 
 17 |         if self.train_cnn and config.conv_kernel_regularizer_scale > 0:
 18 |             self.conv_kernel_regularizer = layers.l2_regularizer(
 19 |                 scale = config.conv_kernel_regularizer_scale)
 20 |         else:
 21 |             self.conv_kernel_regularizer = None
 22 | 
 23 |         if self.train_cnn and config.conv_activity_regularizer_scale > 0:
 24 |             self.conv_activity_regularizer = layers.l1_regularizer(
 25 |                 scale = config.conv_activity_regularizer_scale)
 26 |         else:
 27 |             self.conv_activity_regularizer = None
 28 | 
 29 |         self.fc_kernel_initializer = tf.random_uniform_initializer(
 30 |             minval = -config.fc_kernel_initializer_scale,
 31 |             maxval = config.fc_kernel_initializer_scale)
 32 | 
 33 |         if self.is_train and config.fc_kernel_regularizer_scale > 0:
 34 |             self.fc_kernel_regularizer = layers.l2_regularizer(
 35 |                 scale = config.fc_kernel_regularizer_scale)
 36 |         else:
 37 |             self.fc_kernel_regularizer = None
 38 | 
 39 |         if self.is_train and config.fc_activity_regularizer_scale > 0:
 40 |             self.fc_activity_regularizer = layers.l1_regularizer(
 41 |                 scale = config.fc_activity_regularizer_scale)
 42 |         else:
 43 |             self.fc_activity_regularizer = None
 44 | 
 45 |     def conv2d(self,
 46 |                inputs,
 47 |                filters,
 48 |                kernel_size = (3, 3),
 49 |                strides = (1, 1),
 50 |                activation = tf.nn.relu,
 51 |                use_bias = True,
 52 |                name = None):
 53 |         """ 2D Convolution layer. """
 54 |         if activation is not None:
 55 |             activity_regularizer = self.conv_activity_regularizer
 56 |         else:
 57 |             activity_regularizer = None
 58 |         return tf.layers.conv2d(
 59 |             inputs = inputs,
 60 |             filters = filters,
 61 |             kernel_size = kernel_size,
 62 |             strides = strides,
 63 |             padding='same',
 64 |             activation = activation,
 65 |             use_bias = use_bias,
 66 |             trainable = self.train_cnn,
 67 |             kernel_initializer = self.conv_kernel_initializer,
 68 |             kernel_regularizer = self.conv_kernel_regularizer,
 69 |             activity_regularizer = activity_regularizer,
 70 |             name = name)
 71 | 
 72 |     def max_pool2d(self,
 73 |                    inputs,
 74 |                    pool_size = (2, 2),
 75 |                    strides = (2, 2),
 76 |                    name = None):
 77 |         """ 2D Max Pooling layer. """
 78 |         return tf.layers.max_pooling2d(
 79 |             inputs = inputs,
 80 |             pool_size = pool_size,
 81 |             strides = strides,
 82 |             padding='same',
 83 |             name = name)
 84 | 
 85 |     def dense(self,
 86 |               inputs,
 87 |               units,
 88 |               activation = tf.tanh,
 89 |               use_bias = True,
 90 |               name = None):
 91 |         """ Fully-connected layer. """
 92 |         if activation is not None:
 93 |             activity_regularizer = self.fc_activity_regularizer
 94 |         else:
 95 |             activity_regularizer = None
 96 |         return tf.layers.dense(
 97 |             inputs = inputs,
 98 |             units = units,
 99 |             activation = activation,
100 |             use_bias = use_bias,
101 |             trainable = self.is_train,
102 |             kernel_initializer = self.fc_kernel_initializer,
103 |             kernel_regularizer = self.fc_kernel_regularizer,
104 |             activity_regularizer = activity_regularizer,
105 |             name = name)
106 | 
107 |     def dropout(self,
108 |                 inputs,
109 |                 name = None):
110 |         """ Dropout layer. """
111 |         return tf.layers.dropout(
112 |             inputs = inputs,
113 |             rate = self.config.fc_drop_rate,
114 |             training = self.is_train)
115 | 
116 |     def batch_norm(self,
117 |                    inputs,
118 |                    name = None):
119 |         """ Batch normalization layer. """
120 |         return tf.layers.batch_normalization(
121 |             inputs = inputs,
122 |             training = self.train_cnn,
123 |             trainable = self.train_cnn,
124 |             name = name
125 |         )
126 | 


--------------------------------------------------------------------------------
/utils/vocabulary.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | import pandas as pd
 4 | from tqdm import tqdm
 5 | import string
 6 | from nltk.tokenize import word_tokenize
 7 | 
 8 | class Vocabulary(object):
 9 |     def __init__(self, size, save_file=None):
10 |         self.words = []
11 |         self.word2idx = {}
12 |         self.word_frequencies = []
13 |         self.size = size
14 |         if save_file is not None:
15 |             self.load(save_file)
16 | 
17 |     def build(self, sentences):
18 |         """ Build the vocabulary and compute the frequency of each word. """
19 |         word_counts = {}
20 |         for sentence in tqdm(sentences):
21 |             for w in word_tokenize(sentence.lower()):
22 |                 word_counts[w] = word_counts.get(w, 0) + 1.0
23 | 
24 |         assert self.size-1 <= len(word_counts.keys())
25 |         self.words.append('<start>')
26 |         self.word2idx['<start>'] = 0
27 |         self.word_frequencies.append(1.0)
28 | 
29 |         word_counts = sorted(list(word_counts.items()),
30 |                             key=lambda x: x[1],
31 |                             reverse=True)
32 | 
33 |         for idx in range(self.size-1):
34 |             word, frequency = word_counts[idx]
35 |             self.words.append(word)
36 |             self.word2idx[word] = idx + 1
37 |             self.word_frequencies.append(frequency)
38 | 
39 |         self.word_frequencies = np.array(self.word_frequencies)
40 |         self.word_frequencies /= np.sum(self.word_frequencies)
41 |         self.word_frequencies = np.log(self.word_frequencies)
42 |         self.word_frequencies -= np.max(self.word_frequencies)
43 | 
44 |     def process_sentence(self, sentence):
45 |         """ Tokenize a sentence, and translate each token into its index
46 |             in the vocabulary. """
47 |         words = word_tokenize(sentence.lower())
48 |         word_idxs = [self.word2idx[w] for w in words]
49 |         return word_idxs
50 | 
51 |     def get_sentence(self, idxs):
52 |         """ Translate a vector of indicies into a sentence. """
53 |         words = [self.words[i] for i in idxs]
54 |         if words[-1] != '.':
55 |             words.append('.')
56 |         length = np.argmax(np.array(words)=='.') + 1
57 |         words = words[:length]
58 |         sentence = "".join([" "+w if not w.startswith("'") \
59 |                             and w not in string.punctuation \
60 |                             else w for w in words]).strip()
61 |         return sentence
62 | 
63 |     def save(self, save_file):
64 |         """ Save the vocabulary to a file. """
65 |         data = pd.DataFrame({'word': self.words,
66 |                              'index': list(range(self.size)),
67 |                              'frequency': self.word_frequencies})
68 |         data.to_csv(save_file)
69 | 
70 |     def load(self, save_file):
71 |         """ Load the vocabulary from a file. """
72 |         assert os.path.exists(save_file)
73 |         data = pd.read_csv(save_file)
74 |         self.words = data['word'].values
75 |         self.word2idx = {self.words[i]:i for i in range(self.size)}
76 |         self.word_frequencies = data['frequency'].values
77 | 


--------------------------------------------------------------------------------
/val/images/readme:
--------------------------------------------------------------------------------
1 | Put the COCO val2014 images here.
2 | 


--------------------------------------------------------------------------------
/val/readme:
--------------------------------------------------------------------------------
1 | Put the file captions_val2014.json here.
2 | 


--------------------------------------------------------------------------------