├── LICENSE.md ├── README.md ├── base_model.py ├── config.py ├── dataset.py ├── eval.sh ├── examples ├── COCO_val2014_000000018295_result.jpg ├── COCO_val2014_000000072776_result.jpg ├── COCO_val2014_000000153130_result.jpg ├── COCO_val2014_000000214274_result.jpg ├── COCO_val2014_000000222261_result.jpg ├── COCO_val2014_000000261185_result.jpg ├── COCO_val2014_000000370315_result.jpg ├── COCO_val2014_000000535467_result.jpg └── examples.jpg ├── main.py ├── model.py ├── models ├── readme └── trim_model.py ├── summary └── readme ├── test ├── images │ ├── 1.jpg │ ├── 2.jpg │ └── 3.jpg └── results │ ├── 1_result.jpg │ ├── 2_result.jpg │ └── 3_result.jpg ├── train ├── images │ └── readme └── readme ├── utils ├── __init__.py ├── coco │ ├── __init__.py │ ├── coco.py │ ├── license.txt │ ├── pycocoevalcap │ │ ├── __init__.py │ │ ├── bleu │ │ │ ├── LICENSE │ │ │ ├── __init__.py │ │ │ ├── bleu.py │ │ │ └── bleu_scorer.py │ │ ├── cider │ │ │ ├── __init__.py │ │ │ ├── cider.py │ │ │ └── cider_scorer.py │ │ ├── eval.py │ │ ├── meteor │ │ │ ├── __init__.py │ │ │ ├── data │ │ │ │ └── paraphrase-en.gz │ │ │ ├── meteor-1.5.jar │ │ │ └── meteor.py │ │ ├── readme.md │ │ ├── rouge │ │ │ ├── __init__.py │ │ │ └── rouge.py │ │ └── tokenizer │ │ │ ├── __init__.py │ │ │ ├── ptbtokenizer.py │ │ │ └── stanford-corenlp-3.4.1.jar │ └── readme.md ├── ilsvrc_2012_mean.npy ├── misc.py ├── nn.py └── vocabulary.py └── val ├── images └── readme └── readme /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Guoming Wang & Wenhua Guan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Introduction 2 | This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts. 3 | 4 | ### Prerequisites 5 | * **Tensorflow** ([instructions](https://www.tensorflow.org/install/)) 6 | * **NumPy** ([instructions](https://scipy.org/install.html)) 7 | * **OpenCV** ([instructions](https://pypi.python.org/pypi/opencv-python)) 8 | * **Natural Language Toolkit (NLTK)** ([instructions](http://www.nltk.org/install.html)) 9 | * **Pandas** ([instructions](https://scipy.org/install.html)) 10 | * **Matplotlib** ([instructions](https://scipy.org/install.html)) 11 | * **tqdm** ([instructions](https://pypi.python.org/pypi/tqdm)) 12 | 13 | ### Usage 14 | * **Preparation:** Download the COCO train2014 and val2014 data [here](http://cocodataset.org/#download). Put the COCO train2014 images in the folder `train/images`, and put the file `captions_train2014.json` in the folder `train`. Similarly, put the COCO val2014 images in the folder `val/images`, and put the file `captions_val2014.json` in the folder `val`. Furthermore, download the pretrained VGG16 net [here](https://app.box.com/s/idt5khauxsamcg3y69jz13w6sc6122ph) or ResNet50 net [here](https://app.box.com/s/17vthb1zl0zeh340m4gaw0luuf2vscne) if you want to use it to initialize the CNN part. 15 | 16 | * **Training:** 17 | To train a model using the COCO train2014 data, first setup various parameters in the file `config.py` and then run a command like this: 18 | ```shell 19 | python main.py --phase=train \ 20 | --load_cnn \ 21 | --cnn_model_file='./vgg16_no_fc.npy'\ 22 | [--train_cnn] 23 | ``` 24 | Turn on `--train_cnn` if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder `models`. If you want to resume the training from a checkpoint, run a command like this: 25 | ```shell 26 | python main.py --phase=train \ 27 | --load \ 28 | --model_file='./models/xxxxxx.npy'\ 29 | [--train_cnn] 30 | ``` 31 | To monitor the progress of training, run the following command: 32 | ```shell 33 | tensorboard --logdir='./summary/' 34 | ``` 35 | 36 | * **Evaluation:** 37 | To evaluate a trained model using the COCO val2014 data, run a command like this: 38 | ```shell 39 | python main.py --phase=eval \ 40 | --model_file='./models/xxxxxx.npy' \ 41 | --beam_size=3 42 | ``` 43 | The result will be shown in stdout. Furthermore, the generated captions will be saved in the file `val/results.json`. 44 | 45 | * **Inference:** 46 | You can use the trained model to generate captions for any JPEG images! Put such images in the folder `test/images`, and run a command like this: 47 | ```shell 48 | python main.py --phase=test \ 49 | --model_file='./models/xxxxxx.npy' \ 50 | --beam_size=3 51 | ``` 52 | The generated captions will be saved in the folder `test/results`. 53 | 54 | ### Results 55 | A pretrained model with default configuration can be downloaded [here](https://app.box.com/s/xuigzzaqfbpnf76t295h109ey9po5t8p). This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with `beam size=3`): 56 | * **BLEU-1 = 70.3%** 57 | * **BLEU-2 = 53.6%** 58 | * **BLEU-3 = 39.8%** 59 | * **BLEU-4 = 29.5%** 60 | 61 | Here are some captions generated by this model: 62 | ![examples](examples/examples.jpg) 63 | 64 | ### References 65 | * [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015. 66 | * [The original implementation in Theano](https://github.com/kelvinxu/arctic-captions) 67 | * [An earlier implementation in Tensorflow](https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow) 68 | * [Microsoft COCO dataset](http://mscoco.org/) 69 | -------------------------------------------------------------------------------- /base_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import pandas as pd 4 | import tensorflow as tf 5 | import matplotlib.pyplot as plt 6 | import cPickle as pickle 7 | import copy 8 | import json 9 | from tqdm import tqdm 10 | 11 | from utils.nn import NN 12 | from utils.coco.coco import COCO 13 | from utils.coco.pycocoevalcap.eval import COCOEvalCap 14 | from utils.misc import ImageLoader, CaptionData, TopN 15 | 16 | class BaseModel(object): 17 | def __init__(self, config): 18 | self.config = config 19 | self.is_train = True if config.phase == 'train' else False 20 | self.train_cnn = self.is_train and config.train_cnn 21 | self.image_loader = ImageLoader('./utils/ilsvrc_2012_mean.npy') 22 | self.image_shape = [224, 224, 3] 23 | self.nn = NN(config) 24 | self.global_step = tf.Variable(0, 25 | name = 'global_step', 26 | trainable = False) 27 | self.build() 28 | 29 | def build(self): 30 | raise NotImplementedError() 31 | 32 | def train(self, sess, train_data): 33 | """ Train the model using the COCO train2014 data. """ 34 | print("Training the model...") 35 | config = self.config 36 | 37 | if not os.path.exists(config.summary_dir): 38 | os.mkdir(config.summary_dir) 39 | train_writer = tf.summary.FileWriter(config.summary_dir, 40 | sess.graph) 41 | 42 | for _ in tqdm(list(range(config.num_epochs)), desc='epoch'): 43 | for _ in tqdm(list(range(train_data.num_batches)), desc='batch'): 44 | batch = train_data.next_batch() 45 | image_files, sentences, masks = batch 46 | images = self.image_loader.load_images(image_files) 47 | feed_dict = {self.images: images, 48 | self.sentences: sentences, 49 | self.masks: masks} 50 | _, summary, global_step = sess.run([self.opt_op, 51 | self.summary, 52 | self.global_step], 53 | feed_dict=feed_dict) 54 | if (global_step + 1) % config.save_period == 0: 55 | self.save() 56 | train_writer.add_summary(summary, global_step) 57 | train_data.reset() 58 | 59 | self.save() 60 | train_writer.close() 61 | print("Training complete.") 62 | 63 | def eval(self, sess, eval_gt_coco, eval_data, vocabulary): 64 | """ Evaluate the model using the COCO val2014 data. """ 65 | print("Evaluating the model ...") 66 | config = self.config 67 | 68 | results = [] 69 | if not os.path.exists(config.eval_result_dir): 70 | os.mkdir(config.eval_result_dir) 71 | 72 | # Generate the captions for the images 73 | idx = 0 74 | for k in tqdm(list(range(eval_data.num_batches)), desc='batch'): 75 | batch = eval_data.next_batch() 76 | caption_data = self.beam_search(sess, batch, vocabulary) 77 | 78 | fake_cnt = 0 if k "${filename}.txt" 8 | done 9 | exit 0 10 | -------------------------------------------------------------------------------- /examples/COCO_val2014_000000018295_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000018295_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000072776_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000072776_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000153130_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000153130_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000214274_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000214274_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000222261_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000222261_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000261185_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000261185_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000370315_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000370315_result.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000535467_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/COCO_val2014_000000535467_result.jpg -------------------------------------------------------------------------------- /examples/examples.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/examples/examples.jpg -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import tensorflow as tf 3 | 4 | from config import Config 5 | from model import CaptionGenerator 6 | from dataset import prepare_train_data, prepare_eval_data, prepare_test_data 7 | 8 | FLAGS = tf.app.flags.FLAGS 9 | 10 | tf.flags.DEFINE_string('phase', 'train', 11 | 'The phase can be train, eval or test') 12 | 13 | tf.flags.DEFINE_boolean('load', False, 14 | 'Turn on to load a pretrained model from either \ 15 | the latest checkpoint or a specified file') 16 | 17 | tf.flags.DEFINE_string('model_file', None, 18 | 'If sepcified, load a pretrained model from this file') 19 | 20 | tf.flags.DEFINE_boolean('load_cnn', False, 21 | 'Turn on to load a pretrained CNN model') 22 | 23 | tf.flags.DEFINE_string('cnn_model_file', './vgg16_no_fc.npy', 24 | 'The file containing a pretrained CNN model') 25 | 26 | tf.flags.DEFINE_boolean('train_cnn', False, 27 | 'Turn on to train both CNN and RNN. \ 28 | Otherwise, only RNN is trained') 29 | 30 | tf.flags.DEFINE_integer('beam_size', 3, 31 | 'The size of beam search for caption generation') 32 | 33 | def main(argv): 34 | config = Config() 35 | config.phase = FLAGS.phase 36 | config.train_cnn = FLAGS.train_cnn 37 | config.beam_size = FLAGS.beam_size 38 | 39 | with tf.Session() as sess: 40 | if FLAGS.phase == 'train': 41 | # training phase 42 | data = prepare_train_data(config) 43 | model = CaptionGenerator(config) 44 | sess.run(tf.global_variables_initializer()) 45 | if FLAGS.load: 46 | model.load(sess, FLAGS.model_file) 47 | if FLAGS.load_cnn: 48 | model.load_cnn(sess, FLAGS.cnn_model_file) 49 | tf.get_default_graph().finalize() 50 | model.train(sess, data) 51 | 52 | elif FLAGS.phase == 'eval': 53 | # evaluation phase 54 | coco, data, vocabulary = prepare_eval_data(config) 55 | model = CaptionGenerator(config) 56 | model.load(sess, FLAGS.model_file) 57 | tf.get_default_graph().finalize() 58 | model.eval(sess, coco, data, vocabulary) 59 | 60 | else: 61 | # testing phase 62 | data, vocabulary = prepare_test_data(config) 63 | model = CaptionGenerator(config) 64 | model.load(sess, FLAGS.model_file) 65 | tf.get_default_graph().finalize() 66 | model.test(sess, data, vocabulary) 67 | 68 | if __name__ == '__main__': 69 | tf.app.run() 70 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | from base_model import BaseModel 5 | 6 | class CaptionGenerator(BaseModel): 7 | def build(self): 8 | """ Build the model. """ 9 | self.build_cnn() 10 | self.build_rnn() 11 | if self.is_train: 12 | self.build_optimizer() 13 | self.build_summary() 14 | 15 | def build_cnn(self): 16 | """ Build the CNN. """ 17 | print("Building the CNN...") 18 | if self.config.cnn == 'vgg16': 19 | self.build_vgg16() 20 | else: 21 | self.build_resnet50() 22 | print("CNN built.") 23 | 24 | def build_vgg16(self): 25 | """ Build the VGG16 net. """ 26 | config = self.config 27 | 28 | images = tf.placeholder( 29 | dtype = tf.float32, 30 | shape = [config.batch_size] + self.image_shape) 31 | 32 | conv1_1_feats = self.nn.conv2d(images, 64, name = 'conv1_1') 33 | conv1_2_feats = self.nn.conv2d(conv1_1_feats, 64, name = 'conv1_2') 34 | pool1_feats = self.nn.max_pool2d(conv1_2_feats, name = 'pool1') 35 | 36 | conv2_1_feats = self.nn.conv2d(pool1_feats, 128, name = 'conv2_1') 37 | conv2_2_feats = self.nn.conv2d(conv2_1_feats, 128, name = 'conv2_2') 38 | pool2_feats = self.nn.max_pool2d(conv2_2_feats, name = 'pool2') 39 | 40 | conv3_1_feats = self.nn.conv2d(pool2_feats, 256, name = 'conv3_1') 41 | conv3_2_feats = self.nn.conv2d(conv3_1_feats, 256, name = 'conv3_2') 42 | conv3_3_feats = self.nn.conv2d(conv3_2_feats, 256, name = 'conv3_3') 43 | pool3_feats = self.nn.max_pool2d(conv3_3_feats, name = 'pool3') 44 | 45 | conv4_1_feats = self.nn.conv2d(pool3_feats, 512, name = 'conv4_1') 46 | conv4_2_feats = self.nn.conv2d(conv4_1_feats, 512, name = 'conv4_2') 47 | conv4_3_feats = self.nn.conv2d(conv4_2_feats, 512, name = 'conv4_3') 48 | pool4_feats = self.nn.max_pool2d(conv4_3_feats, name = 'pool4') 49 | 50 | conv5_1_feats = self.nn.conv2d(pool4_feats, 512, name = 'conv5_1') 51 | conv5_2_feats = self.nn.conv2d(conv5_1_feats, 512, name = 'conv5_2') 52 | conv5_3_feats = self.nn.conv2d(conv5_2_feats, 512, name = 'conv5_3') 53 | 54 | reshaped_conv5_3_feats = tf.reshape(conv5_3_feats, 55 | [config.batch_size, 196, 512]) 56 | 57 | self.conv_feats = reshaped_conv5_3_feats 58 | self.num_ctx = 196 59 | self.dim_ctx = 512 60 | self.images = images 61 | 62 | def build_resnet50(self): 63 | """ Build the ResNet50. """ 64 | config = self.config 65 | 66 | images = tf.placeholder( 67 | dtype = tf.float32, 68 | shape = [config.batch_size] + self.image_shape) 69 | 70 | conv1_feats = self.nn.conv2d(images, 71 | filters = 64, 72 | kernel_size = (7, 7), 73 | strides = (2, 2), 74 | activation = None, 75 | name = 'conv1') 76 | conv1_feats = self.nn.batch_norm(conv1_feats, 'bn_conv1') 77 | conv1_feats = tf.nn.relu(conv1_feats) 78 | pool1_feats = self.nn.max_pool2d(conv1_feats, 79 | pool_size = (3, 3), 80 | strides = (2, 2), 81 | name = 'pool1') 82 | 83 | res2a_feats = self.resnet_block(pool1_feats, 'res2a', 'bn2a', 64, 1) 84 | res2b_feats = self.resnet_block2(res2a_feats, 'res2b', 'bn2b', 64) 85 | res2c_feats = self.resnet_block2(res2b_feats, 'res2c', 'bn2c', 64) 86 | 87 | res3a_feats = self.resnet_block(res2c_feats, 'res3a', 'bn3a', 128) 88 | res3b_feats = self.resnet_block2(res3a_feats, 'res3b', 'bn3b', 128) 89 | res3c_feats = self.resnet_block2(res3b_feats, 'res3c', 'bn3c', 128) 90 | res3d_feats = self.resnet_block2(res3c_feats, 'res3d', 'bn3d', 128) 91 | 92 | res4a_feats = self.resnet_block(res3d_feats, 'res4a', 'bn4a', 256) 93 | res4b_feats = self.resnet_block2(res4a_feats, 'res4b', 'bn4b', 256) 94 | res4c_feats = self.resnet_block2(res4b_feats, 'res4c', 'bn4c', 256) 95 | res4d_feats = self.resnet_block2(res4c_feats, 'res4d', 'bn4d', 256) 96 | res4e_feats = self.resnet_block2(res4d_feats, 'res4e', 'bn4e', 256) 97 | res4f_feats = self.resnet_block2(res4e_feats, 'res4f', 'bn4f', 256) 98 | 99 | res5a_feats = self.resnet_block(res4f_feats, 'res5a', 'bn5a', 512) 100 | res5b_feats = self.resnet_block2(res5a_feats, 'res5b', 'bn5b', 512) 101 | res5c_feats = self.resnet_block2(res5b_feats, 'res5c', 'bn5c', 512) 102 | 103 | reshaped_res5c_feats = tf.reshape(res5c_feats, 104 | [config.batch_size, 49, 2048]) 105 | 106 | self.conv_feats = reshaped_res5c_feats 107 | self.num_ctx = 49 108 | self.dim_ctx = 2048 109 | self.images = images 110 | 111 | def resnet_block(self, inputs, name1, name2, c, s=2): 112 | """ A basic block of ResNet. """ 113 | branch1_feats = self.nn.conv2d(inputs, 114 | filters = 4*c, 115 | kernel_size = (1, 1), 116 | strides = (s, s), 117 | activation = None, 118 | use_bias = False, 119 | name = name1+'_branch1') 120 | branch1_feats = self.nn.batch_norm(branch1_feats, name2+'_branch1') 121 | 122 | branch2a_feats = self.nn.conv2d(inputs, 123 | filters = c, 124 | kernel_size = (1, 1), 125 | strides = (s, s), 126 | activation = None, 127 | use_bias = False, 128 | name = name1+'_branch2a') 129 | branch2a_feats = self.nn.batch_norm(branch2a_feats, name2+'_branch2a') 130 | branch2a_feats = tf.nn.relu(branch2a_feats) 131 | 132 | branch2b_feats = self.nn.conv2d(branch2a_feats, 133 | filters = c, 134 | kernel_size = (3, 3), 135 | strides = (1, 1), 136 | activation = None, 137 | use_bias = False, 138 | name = name1+'_branch2b') 139 | branch2b_feats = self.nn.batch_norm(branch2b_feats, name2+'_branch2b') 140 | branch2b_feats = tf.nn.relu(branch2b_feats) 141 | 142 | branch2c_feats = self.nn.conv2d(branch2b_feats, 143 | filters = 4*c, 144 | kernel_size = (1, 1), 145 | strides = (1, 1), 146 | activation = None, 147 | use_bias = False, 148 | name = name1+'_branch2c') 149 | branch2c_feats = self.nn.batch_norm(branch2c_feats, name2+'_branch2c') 150 | 151 | outputs = branch1_feats + branch2c_feats 152 | outputs = tf.nn.relu(outputs) 153 | return outputs 154 | 155 | def resnet_block2(self, inputs, name1, name2, c): 156 | """ Another basic block of ResNet. """ 157 | branch2a_feats = self.nn.conv2d(inputs, 158 | filters = c, 159 | kernel_size = (1, 1), 160 | strides = (1, 1), 161 | activation = None, 162 | use_bias = False, 163 | name = name1+'_branch2a') 164 | branch2a_feats = self.nn.batch_norm(branch2a_feats, name2+'_branch2a') 165 | branch2a_feats = tf.nn.relu(branch2a_feats) 166 | 167 | branch2b_feats = self.nn.conv2d(branch2a_feats, 168 | filters = c, 169 | kernel_size = (3, 3), 170 | strides = (1, 1), 171 | activation = None, 172 | use_bias = False, 173 | name = name1+'_branch2b') 174 | branch2b_feats = self.nn.batch_norm(branch2b_feats, name2+'_branch2b') 175 | branch2b_feats = tf.nn.relu(branch2b_feats) 176 | 177 | branch2c_feats = self.nn.conv2d(branch2b_feats, 178 | filters = 4*c, 179 | kernel_size = (1, 1), 180 | strides = (1, 1), 181 | activation = None, 182 | use_bias = False, 183 | name = name1+'_branch2c') 184 | branch2c_feats = self.nn.batch_norm(branch2c_feats, name2+'_branch2c') 185 | 186 | outputs = inputs + branch2c_feats 187 | outputs = tf.nn.relu(outputs) 188 | return outputs 189 | 190 | def build_rnn(self): 191 | """ Build the RNN. """ 192 | print("Building the RNN...") 193 | config = self.config 194 | 195 | # Setup the placeholders 196 | if self.is_train: 197 | contexts = self.conv_feats 198 | sentences = tf.placeholder( 199 | dtype = tf.int32, 200 | shape = [config.batch_size, config.max_caption_length]) 201 | masks = tf.placeholder( 202 | dtype = tf.float32, 203 | shape = [config.batch_size, config.max_caption_length]) 204 | else: 205 | contexts = tf.placeholder( 206 | dtype = tf.float32, 207 | shape = [config.batch_size, self.num_ctx, self.dim_ctx]) 208 | last_memory = tf.placeholder( 209 | dtype = tf.float32, 210 | shape = [config.batch_size, config.num_lstm_units]) 211 | last_output = tf.placeholder( 212 | dtype = tf.float32, 213 | shape = [config.batch_size, config.num_lstm_units]) 214 | last_word = tf.placeholder( 215 | dtype = tf.int32, 216 | shape = [config.batch_size]) 217 | 218 | # Setup the word embedding 219 | with tf.variable_scope("word_embedding"): 220 | embedding_matrix = tf.get_variable( 221 | name = 'weights', 222 | shape = [config.vocabulary_size, config.dim_embedding], 223 | initializer = self.nn.fc_kernel_initializer, 224 | regularizer = self.nn.fc_kernel_regularizer, 225 | trainable = self.is_train) 226 | 227 | # Setup the LSTM 228 | lstm = tf.nn.rnn_cell.LSTMCell( 229 | config.num_lstm_units, 230 | initializer = self.nn.fc_kernel_initializer) 231 | if self.is_train: 232 | lstm = tf.nn.rnn_cell.DropoutWrapper( 233 | lstm, 234 | input_keep_prob = 1.0-config.lstm_drop_rate, 235 | output_keep_prob = 1.0-config.lstm_drop_rate, 236 | state_keep_prob = 1.0-config.lstm_drop_rate) 237 | 238 | # Initialize the LSTM using the mean context 239 | with tf.variable_scope("initialize"): 240 | context_mean = tf.reduce_mean(self.conv_feats, axis = 1) 241 | initial_memory, initial_output = self.initialize(context_mean) 242 | initial_state = initial_memory, initial_output 243 | 244 | # Prepare to run 245 | predictions = [] 246 | if self.is_train: 247 | alphas = [] 248 | cross_entropies = [] 249 | predictions_correct = [] 250 | num_steps = config.max_caption_length 251 | last_output = initial_output 252 | last_memory = initial_memory 253 | last_word = tf.zeros([config.batch_size], tf.int32) 254 | else: 255 | num_steps = 1 256 | last_state = last_memory, last_output 257 | 258 | # Generate the words one by one 259 | for idx in range(num_steps): 260 | # Attention mechanism 261 | with tf.variable_scope("attend"): 262 | alpha = self.attend(contexts, last_output) 263 | context = tf.reduce_sum(contexts*tf.expand_dims(alpha, 2), 264 | axis = 1) 265 | if self.is_train: 266 | tiled_masks = tf.tile(tf.expand_dims(masks[:, idx], 1), 267 | [1, self.num_ctx]) 268 | masked_alpha = alpha * tiled_masks 269 | alphas.append(tf.reshape(masked_alpha, [-1])) 270 | 271 | # Embed the last word 272 | with tf.variable_scope("word_embedding"): 273 | word_embed = tf.nn.embedding_lookup(embedding_matrix, 274 | last_word) 275 | # Apply the LSTM 276 | with tf.variable_scope("lstm"): 277 | current_input = tf.concat([context, word_embed], 1) 278 | output, state = lstm(current_input, last_state) 279 | memory, _ = state 280 | 281 | # Decode the expanded output of LSTM into a word 282 | with tf.variable_scope("decode"): 283 | expanded_output = tf.concat([output, 284 | context, 285 | word_embed], 286 | axis = 1) 287 | logits = self.decode(expanded_output) 288 | probs = tf.nn.softmax(logits) 289 | prediction = tf.argmax(logits, 1) 290 | predictions.append(prediction) 291 | 292 | # Compute the loss for this step, if necessary 293 | if self.is_train: 294 | cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits( 295 | labels = sentences[:, idx], 296 | logits = logits) 297 | masked_cross_entropy = cross_entropy * masks[:, idx] 298 | cross_entropies.append(masked_cross_entropy) 299 | 300 | ground_truth = tf.cast(sentences[:, idx], tf.int64) 301 | prediction_correct = tf.where( 302 | tf.equal(prediction, ground_truth), 303 | tf.cast(masks[:, idx], tf.float32), 304 | tf.cast(tf.zeros_like(prediction), tf.float32)) 305 | predictions_correct.append(prediction_correct) 306 | 307 | last_output = output 308 | last_memory = memory 309 | last_state = state 310 | last_word = sentences[:, idx] 311 | 312 | tf.get_variable_scope().reuse_variables() 313 | 314 | # Compute the final loss, if necessary 315 | if self.is_train: 316 | cross_entropies = tf.stack(cross_entropies, axis = 1) 317 | cross_entropy_loss = tf.reduce_sum(cross_entropies) \ 318 | / tf.reduce_sum(masks) 319 | 320 | alphas = tf.stack(alphas, axis = 1) 321 | alphas = tf.reshape(alphas, [config.batch_size, self.num_ctx, -1]) 322 | attentions = tf.reduce_sum(alphas, axis = 2) 323 | diffs = tf.ones_like(attentions) - attentions 324 | attention_loss = config.attention_loss_factor \ 325 | * tf.nn.l2_loss(diffs) \ 326 | / (config.batch_size * self.num_ctx) 327 | 328 | reg_loss = tf.losses.get_regularization_loss() 329 | 330 | total_loss = cross_entropy_loss + attention_loss + reg_loss 331 | 332 | predictions_correct = tf.stack(predictions_correct, axis = 1) 333 | accuracy = tf.reduce_sum(predictions_correct) \ 334 | / tf.reduce_sum(masks) 335 | 336 | self.contexts = contexts 337 | if self.is_train: 338 | self.sentences = sentences 339 | self.masks = masks 340 | self.total_loss = total_loss 341 | self.cross_entropy_loss = cross_entropy_loss 342 | self.attention_loss = attention_loss 343 | self.reg_loss = reg_loss 344 | self.accuracy = accuracy 345 | self.attentions = attentions 346 | else: 347 | self.initial_memory = initial_memory 348 | self.initial_output = initial_output 349 | self.last_memory = last_memory 350 | self.last_output = last_output 351 | self.last_word = last_word 352 | self.memory = memory 353 | self.output = output 354 | self.probs = probs 355 | 356 | print("RNN built.") 357 | 358 | def initialize(self, context_mean): 359 | """ Initialize the LSTM using the mean context. """ 360 | config = self.config 361 | context_mean = self.nn.dropout(context_mean) 362 | if config.num_initalize_layers == 1: 363 | # use 1 fc layer to initialize 364 | memory = self.nn.dense(context_mean, 365 | units = config.num_lstm_units, 366 | activation = None, 367 | name = 'fc_a') 368 | output = self.nn.dense(context_mean, 369 | units = config.num_lstm_units, 370 | activation = None, 371 | name = 'fc_b') 372 | else: 373 | # use 2 fc layers to initialize 374 | temp1 = self.nn.dense(context_mean, 375 | units = config.dim_initalize_layer, 376 | activation = tf.tanh, 377 | name = 'fc_a1') 378 | temp1 = self.nn.dropout(temp1) 379 | memory = self.nn.dense(temp1, 380 | units = config.num_lstm_units, 381 | activation = None, 382 | name = 'fc_a2') 383 | 384 | temp2 = self.nn.dense(context_mean, 385 | units = config.dim_initalize_layer, 386 | activation = tf.tanh, 387 | name = 'fc_b1') 388 | temp2 = self.nn.dropout(temp2) 389 | output = self.nn.dense(temp2, 390 | units = config.num_lstm_units, 391 | activation = None, 392 | name = 'fc_b2') 393 | return memory, output 394 | 395 | def attend(self, contexts, output): 396 | """ Attention Mechanism. """ 397 | config = self.config 398 | reshaped_contexts = tf.reshape(contexts, [-1, self.dim_ctx]) 399 | reshaped_contexts = self.nn.dropout(reshaped_contexts) 400 | output = self.nn.dropout(output) 401 | if config.num_attend_layers == 1: 402 | # use 1 fc layer to attend 403 | logits1 = self.nn.dense(reshaped_contexts, 404 | units = 1, 405 | activation = None, 406 | use_bias = False, 407 | name = 'fc_a') 408 | logits1 = tf.reshape(logits1, [-1, self.num_ctx]) 409 | logits2 = self.nn.dense(output, 410 | units = self.num_ctx, 411 | activation = None, 412 | use_bias = False, 413 | name = 'fc_b') 414 | logits = logits1 + logits2 415 | else: 416 | # use 2 fc layers to attend 417 | temp1 = self.nn.dense(reshaped_contexts, 418 | units = config.dim_attend_layer, 419 | activation = tf.tanh, 420 | name = 'fc_1a') 421 | temp2 = self.nn.dense(output, 422 | units = config.dim_attend_layer, 423 | activation = tf.tanh, 424 | name = 'fc_1b') 425 | temp2 = tf.tile(tf.expand_dims(temp2, 1), [1, self.num_ctx, 1]) 426 | temp2 = tf.reshape(temp2, [-1, config.dim_attend_layer]) 427 | temp = temp1 + temp2 428 | temp = self.nn.dropout(temp) 429 | logits = self.nn.dense(temp, 430 | units = 1, 431 | activation = None, 432 | use_bias = False, 433 | name = 'fc_2') 434 | logits = tf.reshape(logits, [-1, self.num_ctx]) 435 | alpha = tf.nn.softmax(logits) 436 | return alpha 437 | 438 | def decode(self, expanded_output): 439 | """ Decode the expanded output of the LSTM into a word. """ 440 | config = self.config 441 | expanded_output = self.nn.dropout(expanded_output) 442 | if config.num_decode_layers == 1: 443 | # use 1 fc layer to decode 444 | logits = self.nn.dense(expanded_output, 445 | units = config.vocabulary_size, 446 | activation = None, 447 | name = 'fc') 448 | else: 449 | # use 2 fc layers to decode 450 | temp = self.nn.dense(expanded_output, 451 | units = config.dim_decode_layer, 452 | activation = tf.tanh, 453 | name = 'fc_1') 454 | temp = self.nn.dropout(temp) 455 | logits = self.nn.dense(temp, 456 | units = config.vocabulary_size, 457 | activation = None, 458 | name = 'fc_2') 459 | return logits 460 | 461 | def build_optimizer(self): 462 | """ Setup the optimizer and training operation. """ 463 | config = self.config 464 | 465 | learning_rate = tf.constant(config.initial_learning_rate) 466 | if config.learning_rate_decay_factor < 1.0: 467 | def _learning_rate_decay_fn(learning_rate, global_step): 468 | return tf.train.exponential_decay( 469 | learning_rate, 470 | global_step, 471 | decay_steps = config.num_steps_per_decay, 472 | decay_rate = config.learning_rate_decay_factor, 473 | staircase = True) 474 | learning_rate_decay_fn = _learning_rate_decay_fn 475 | else: 476 | learning_rate_decay_fn = None 477 | 478 | with tf.variable_scope('optimizer', reuse = tf.AUTO_REUSE): 479 | if config.optimizer == 'Adam': 480 | optimizer = tf.train.AdamOptimizer( 481 | learning_rate = config.initial_learning_rate, 482 | beta1 = config.beta1, 483 | beta2 = config.beta2, 484 | epsilon = config.epsilon 485 | ) 486 | elif config.optimizer == 'RMSProp': 487 | optimizer = tf.train.RMSPropOptimizer( 488 | learning_rate = config.initial_learning_rate, 489 | decay = config.decay, 490 | momentum = config.momentum, 491 | centered = config.centered, 492 | epsilon = config.epsilon 493 | ) 494 | elif config.optimizer == 'Momentum': 495 | optimizer = tf.train.MomentumOptimizer( 496 | learning_rate = config.initial_learning_rate, 497 | momentum = config.momentum, 498 | use_nesterov = config.use_nesterov 499 | ) 500 | else: 501 | optimizer = tf.train.GradientDescentOptimizer( 502 | learning_rate = config.initial_learning_rate 503 | ) 504 | 505 | opt_op = tf.contrib.layers.optimize_loss( 506 | loss = self.total_loss, 507 | global_step = self.global_step, 508 | learning_rate = learning_rate, 509 | optimizer = optimizer, 510 | clip_gradients = config.clip_gradients, 511 | learning_rate_decay_fn = learning_rate_decay_fn) 512 | 513 | self.opt_op = opt_op 514 | 515 | def build_summary(self): 516 | """ Build the summary (for TensorBoard visualization). """ 517 | with tf.name_scope("variables"): 518 | for var in tf.trainable_variables(): 519 | with tf.name_scope(var.name[:var.name.find(":")]): 520 | self.variable_summary(var) 521 | 522 | with tf.name_scope("metrics"): 523 | tf.summary.scalar("cross_entropy_loss", self.cross_entropy_loss) 524 | tf.summary.scalar("attention_loss", self.attention_loss) 525 | tf.summary.scalar("reg_loss", self.reg_loss) 526 | tf.summary.scalar("total_loss", self.total_loss) 527 | tf.summary.scalar("accuracy", self.accuracy) 528 | 529 | with tf.name_scope("attentions"): 530 | self.variable_summary(self.attentions) 531 | 532 | self.summary = tf.summary.merge_all() 533 | 534 | def variable_summary(self, var): 535 | """ Build the summary for a variable. """ 536 | mean = tf.reduce_mean(var) 537 | tf.summary.scalar('mean', mean) 538 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) 539 | tf.summary.scalar('stddev', stddev) 540 | tf.summary.scalar('max', tf.reduce_max(var)) 541 | tf.summary.scalar('min', tf.reduce_min(var)) 542 | tf.summary.histogram('histogram', var) 543 | -------------------------------------------------------------------------------- /models/readme: -------------------------------------------------------------------------------- 1 | The trained models will be saved here. 2 | -------------------------------------------------------------------------------- /models/trim_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Run this script to remove the data that are only useful for training 4 | # from your model files in order to make them more compact. 5 | 6 | import os 7 | import numpy as np 8 | 9 | if __name__=='__main__': 10 | files = os.listdir('.') 11 | model_files = [f for f in files if f.endswith('.npy')] 12 | 13 | for model_file in model_files: 14 | model = np.load(model_file).item() 15 | trimmed_model = {var_name: model[var_name] for var_name in model.keys() 16 | if 'optimizer' not in var_name} 17 | os.rename(model_file, model_file[:-4]+'_old.npy') 18 | np.save(model_file, trimmed_model) 19 | -------------------------------------------------------------------------------- /summary/readme: -------------------------------------------------------------------------------- 1 | The summary (for TensorBoard visualization) will be saved here. 2 | -------------------------------------------------------------------------------- /test/images/1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/images/1.jpg -------------------------------------------------------------------------------- /test/images/2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/images/2.jpg -------------------------------------------------------------------------------- /test/images/3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/images/3.jpg -------------------------------------------------------------------------------- /test/results/1_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/results/1_result.jpg -------------------------------------------------------------------------------- /test/results/2_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/results/2_result.jpg -------------------------------------------------------------------------------- /test/results/3_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/test/results/3_result.jpg -------------------------------------------------------------------------------- /train/images/readme: -------------------------------------------------------------------------------- 1 | Put the COCO train2014 images here. 2 | -------------------------------------------------------------------------------- /train/readme: -------------------------------------------------------------------------------- 1 | Put the file captions_train2014.json here. 2 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/__init__.py -------------------------------------------------------------------------------- /utils/coco/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /utils/coco/coco.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | __version__ = '2.0' 3 | # Interface for accessing the Microsoft COCO dataset. 4 | 5 | # Microsoft COCO is a large image dataset designed for object detection, 6 | # segmentation, and caption generation. pycocotools is a Python API that 7 | # assists in loading, parsing and visualizing the annotations in COCO. 8 | # Please visit http://mscoco.org/ for more information on COCO, including 9 | # for the data, paper, and tutorials. The exact format of the annotations 10 | # is also described on the COCO website. For example usage of the pycocotools 11 | # please see pycocotools_demo.ipynb. In addition to this API, please download both 12 | # the COCO images and annotations in order to run the demo. 13 | 14 | # An alternative to using the API is to load the annotations directly 15 | # into Python dictionary 16 | # Using the API provides additional utility functions. Note that this API 17 | # supports both *instance* and *caption* annotations. In the case of 18 | # captions not all functions are defined (e.g. categories are undefined). 19 | 20 | # The following API functions are defined: 21 | # COCO - COCO api class that loads COCO annotation file and prepare data structures. 22 | # decodeMask - Decode binary mask M encoded via run-length encoding. 23 | # encodeMask - Encode binary mask M using run-length encoding. 24 | # getAnnIds - Get ann ids that satisfy given filter conditions. 25 | # getCatIds - Get cat ids that satisfy given filter conditions. 26 | # getImgIds - Get img ids that satisfy given filter conditions. 27 | # loadAnns - Load anns with the specified ids. 28 | # loadCats - Load cats with the specified ids. 29 | # loadImgs - Load imgs with the specified ids. 30 | # segToMask - Convert polygon segmentation to binary mask. 31 | # showAnns - Display the specified annotations. 32 | # loadRes - Load algorithm results and create API for accessing them. 33 | # download - Download COCO images from mscoco.org server. 34 | # Throughout the API "ann"=annotation, "cat"=category, and "img"=image. 35 | # Help on each functions can be accessed by: "help COCO>function". 36 | 37 | # See also COCO>decodeMask, 38 | # COCO>encodeMask, COCO>getAnnIds, COCO>getCatIds, 39 | # COCO>getImgIds, COCO>loadAnns, COCO>loadCats, 40 | # COCO>loadImgs, COCO>segToMask, COCO>showAnns 41 | 42 | # Microsoft COCO Toolbox. version 2.0 43 | # Data, paper, and tutorials available at: http://mscoco.org/ 44 | # Code written by Piotr Dollar and Tsung-Yi Lin, 2014. 45 | # Licensed under the Simplified BSD License [see bsd.txt] 46 | 47 | import json 48 | import datetime 49 | import time 50 | import matplotlib.pyplot as plt 51 | from matplotlib.collections import PatchCollection 52 | from matplotlib.patches import Polygon 53 | import numpy as np 54 | from skimage.draw import polygon 55 | import urllib 56 | import copy 57 | import itertools 58 | import os 59 | import string 60 | from tqdm import tqdm 61 | from nltk.tokenize import word_tokenize 62 | 63 | class COCO: 64 | def __init__(self, annotation_file=None): 65 | """ 66 | Constructor of Microsoft COCO helper class for reading and visualizing annotations. 67 | :param annotation_file (str): location of annotation file 68 | :param image_folder (str): location to the folder that hosts images. 69 | :return: 70 | """ 71 | # load dataset 72 | self.dataset = {} 73 | self.anns = [] 74 | self.imgToAnns = {} 75 | self.catToImgs = {} 76 | self.imgs = {} 77 | self.cats = {} 78 | self.img_name_to_id = {} 79 | 80 | if not annotation_file == None: 81 | print 'loading annotations into memory...' 82 | tic = time.time() 83 | dataset = json.load(open(annotation_file, 'r')) 84 | print 'Done (t=%0.2fs)'%(time.time()- tic) 85 | self.dataset = dataset 86 | self.process_dataset() 87 | self.createIndex() 88 | 89 | def createIndex(self): 90 | # create index 91 | print 'creating index...' 92 | anns = {} 93 | imgToAnns = {} 94 | catToImgs = {} 95 | cats = {} 96 | imgs = {} 97 | img_name_to_id = {} 98 | 99 | if 'annotations' in self.dataset: 100 | imgToAnns = {ann['image_id']: [] for ann in self.dataset['annotations']} 101 | anns = {ann['id']: [] for ann in self.dataset['annotations']} 102 | for ann in self.dataset['annotations']: 103 | imgToAnns[ann['image_id']] += [ann] 104 | anns[ann['id']] = ann 105 | 106 | if 'images' in self.dataset: 107 | imgs = {im['id']: {} for im in self.dataset['images']} 108 | for img in self.dataset['images']: 109 | imgs[img['id']] = img 110 | img_name_to_id[img['file_name']] = img['id'] 111 | 112 | if 'categories' in self.dataset: 113 | cats = {cat['id']: [] for cat in self.dataset['categories']} 114 | for cat in self.dataset['categories']: 115 | cats[cat['id']] = cat 116 | catToImgs = {cat['id']: [] for cat in self.dataset['categories']} 117 | for ann in self.dataset['annotations']: 118 | catToImgs[ann['category_id']] += [ann['image_id']] 119 | 120 | print 'index created!' 121 | 122 | # create class members 123 | self.anns = anns 124 | self.imgToAnns = imgToAnns 125 | self.catToImgs = catToImgs 126 | self.imgs = imgs 127 | self.cats = cats 128 | self.img_name_to_id = img_name_to_id 129 | 130 | def info(self): 131 | """ 132 | Print information about the annotation file. 133 | :return: 134 | """ 135 | for key, value in self.dataset['info'].items(): 136 | print '%s: %s'%(key, value) 137 | 138 | def getAnnIds(self, imgIds=[], catIds=[], areaRng=[], iscrowd=None): 139 | """ 140 | Get ann ids that satisfy given filter conditions. default skips that filter 141 | :param imgIds (int array) : get anns for given imgs 142 | catIds (int array) : get anns for given cats 143 | areaRng (float array) : get anns for given area range (e.g. [0 inf]) 144 | iscrowd (boolean) : get anns for given crowd label (False or True) 145 | :return: ids (int array) : integer array of ann ids 146 | """ 147 | imgIds = imgIds if type(imgIds) == list else [imgIds] 148 | catIds = catIds if type(catIds) == list else [catIds] 149 | 150 | if len(imgIds) == len(catIds) == len(areaRng) == 0: 151 | anns = self.dataset['annotations'] 152 | else: 153 | if not len(imgIds) == 0: 154 | # this can be changed by defaultdict 155 | lists = [self.imgToAnns[imgId] for imgId in imgIds if imgId in self.imgToAnns] 156 | anns = list(itertools.chain.from_iterable(lists)) 157 | else: 158 | anns = self.dataset['annotations'] 159 | anns = anns if len(catIds) == 0 else [ann for ann in anns if ann['category_id'] in catIds] 160 | anns = anns if len(areaRng) == 0 else [ann for ann in anns if ann['area'] > areaRng[0] and ann['area'] < areaRng[1]] 161 | if not iscrowd == None: 162 | ids = [ann['id'] for ann in anns if ann['iscrowd'] == iscrowd] 163 | else: 164 | ids = [ann['id'] for ann in anns] 165 | return ids 166 | 167 | def getCatIds(self, catNms=[], supNms=[], catIds=[]): 168 | """ 169 | filtering parameters. default skips that filter. 170 | :param catNms (str array) : get cats for given cat names 171 | :param supNms (str array) : get cats for given supercategory names 172 | :param catIds (int array) : get cats for given cat ids 173 | :return: ids (int array) : integer array of cat ids 174 | """ 175 | catNms = catNms if type(catNms) == list else [catNms] 176 | supNms = supNms if type(supNms) == list else [supNms] 177 | catIds = catIds if type(catIds) == list else [catIds] 178 | 179 | if len(catNms) == len(supNms) == len(catIds) == 0: 180 | cats = self.dataset['categories'] 181 | else: 182 | cats = self.dataset['categories'] 183 | cats = cats if len(catNms) == 0 else [cat for cat in cats if cat['name'] in catNms] 184 | cats = cats if len(supNms) == 0 else [cat for cat in cats if cat['supercategory'] in supNms] 185 | cats = cats if len(catIds) == 0 else [cat for cat in cats if cat['id'] in catIds] 186 | ids = [cat['id'] for cat in cats] 187 | return ids 188 | 189 | def getImgIds(self, imgIds=[], catIds=[]): 190 | ''' 191 | Get img ids that satisfy given filter conditions. 192 | :param imgIds (int array) : get imgs for given ids 193 | :param catIds (int array) : get imgs with all given cats 194 | :return: ids (int array) : integer array of img ids 195 | ''' 196 | imgIds = imgIds if type(imgIds) == list else [imgIds] 197 | catIds = catIds if type(catIds) == list else [catIds] 198 | 199 | if len(imgIds) == len(catIds) == 0: 200 | ids = self.imgs.keys() 201 | else: 202 | ids = set(imgIds) 203 | for i, catId in enumerate(catIds): 204 | if i == 0 and len(ids) == 0: 205 | ids = set(self.catToImgs[catId]) 206 | else: 207 | ids &= set(self.catToImgs[catId]) 208 | return list(ids) 209 | 210 | def loadAnns(self, ids=[]): 211 | """ 212 | Load anns with the specified ids. 213 | :param ids (int array) : integer ids specifying anns 214 | :return: anns (object array) : loaded ann objects 215 | """ 216 | if type(ids) == list: 217 | return [self.anns[id] for id in ids] 218 | elif type(ids) == int: 219 | return [self.anns[ids]] 220 | 221 | def loadCats(self, ids=[]): 222 | """ 223 | Load cats with the specified ids. 224 | :param ids (int array) : integer ids specifying cats 225 | :return: cats (object array) : loaded cat objects 226 | """ 227 | if type(ids) == list: 228 | return [self.cats[id] for id in ids] 229 | elif type(ids) == int: 230 | return [self.cats[ids]] 231 | 232 | def loadImgs(self, ids=[]): 233 | """ 234 | Load anns with the specified ids. 235 | :param ids (int array) : integer ids specifying img 236 | :return: imgs (object array) : loaded img objects 237 | """ 238 | if type(ids) == list: 239 | return [self.imgs[id] for id in ids] 240 | elif type(ids) == int: 241 | return [self.imgs[ids]] 242 | 243 | def loadRes(self, resFile): 244 | """ 245 | Load result file and return a result api object. 246 | :param resFile (str) : file name of result file 247 | :return: res (obj) : result api object 248 | """ 249 | res = COCO() 250 | res.dataset['images'] = [img for img in self.dataset['images']] 251 | # res.dataset['info'] = copy.deepcopy(self.dataset['info']) 252 | # res.dataset['licenses'] = copy.deepcopy(self.dataset['licenses']) 253 | 254 | print 'Loading and preparing results... ' 255 | tic = time.time() 256 | anns = json.load(open(resFile)) 257 | assert type(anns) == list, 'results in not an array of objects' 258 | annsImgIds = [ann['image_id'] for ann in anns] 259 | assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \ 260 | 'Results do not correspond to current coco set' 261 | assert 'caption' in anns[0] 262 | imgIds = set([img['id'] for img in res.dataset['images']]) & set([ann['image_id'] for ann in anns]) 263 | res.dataset['images'] = [img for img in res.dataset['images'] if img['id'] in imgIds] 264 | for id, ann in enumerate(anns): 265 | ann['id'] = id+1 266 | print 'DONE (t=%0.2fs)'%(time.time()- tic) 267 | 268 | res.dataset['annotations'] = anns 269 | res.createIndex() 270 | return res 271 | 272 | def download( self, tarDir = None, imgIds = [] ): 273 | ''' 274 | Download COCO images from mscoco.org server. 275 | :param tarDir (str): COCO results directory name 276 | imgIds (list): images to be downloaded 277 | :return: 278 | ''' 279 | if tarDir is None: 280 | print 'Please specify target directory' 281 | return -1 282 | if len(imgIds) == 0: 283 | imgs = self.imgs.values() 284 | else: 285 | imgs = self.loadImgs(imgIds) 286 | N = len(imgs) 287 | if not os.path.exists(tarDir): 288 | os.makedirs(tarDir) 289 | for i, img in enumerate(imgs): 290 | tic = time.time() 291 | fname = os.path.join(tarDir, img['file_name']) 292 | if not os.path.exists(fname): 293 | urllib.urlretrieve(img['coco_url'], fname) 294 | print 'downloaded %d/%d images (t=%.1fs)'%(i, N, time.time()- tic) 295 | 296 | def process_dataset(self): 297 | for ann in self.dataset['annotations']: 298 | q = ann['caption'].lower() 299 | if q[-1]!='.': 300 | q = q + '.' 301 | ann['caption'] = q 302 | 303 | def filter_by_cap_len(self, max_cap_len): 304 | print("Filtering the captions by length...") 305 | keep_ann = {} 306 | keep_img = {} 307 | for ann in tqdm(self.dataset['annotations']): 308 | if len(word_tokenize(ann['caption']))<=max_cap_len: 309 | keep_ann[ann['id']] = keep_ann.get(ann['id'], 0) + 1 310 | keep_img[ann['image_id']] = keep_img.get(ann['image_id'], 0) + 1 311 | 312 | self.dataset['annotations'] = \ 313 | [ann for ann in self.dataset['annotations'] \ 314 | if keep_ann.get(ann['id'],0)>0] 315 | self.dataset['images'] = \ 316 | [img for img in self.dataset['images'] \ 317 | if keep_img.get(img['id'],0)>0] 318 | 319 | self.createIndex() 320 | 321 | def filter_by_words(self, vocab): 322 | print("Filtering the captions by words...") 323 | keep_ann = {} 324 | keep_img = {} 325 | for ann in tqdm(self.dataset['annotations']): 326 | keep_ann[ann['id']] = 1 327 | words_in_ann = word_tokenize(ann['caption']) 328 | for word in words_in_ann: 329 | if word not in vocab: 330 | keep_ann[ann['id']] = 0 331 | break 332 | keep_img[ann['image_id']] = keep_img.get(ann['image_id'], 0) + 1 333 | 334 | self.dataset['annotations'] = \ 335 | [ann for ann in self.dataset['annotations'] \ 336 | if keep_ann.get(ann['id'],0)>0] 337 | self.dataset['images'] = \ 338 | [img for img in self.dataset['images'] \ 339 | if keep_img.get(img['id'],0)>0] 340 | 341 | self.createIndex() 342 | 343 | def all_captions(self): 344 | return [ann['caption'] for ann_id, ann in self.anns.items()] 345 | -------------------------------------------------------------------------------- /utils/coco/license.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2014, Piotr Dollar and Tsung-Yi Lin 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | 24 | The views and conclusions contained in the software and documentation are those 25 | of the authors and should not be interpreted as representing official policies, 26 | either expressed or implied, of the FreeBSD Project. 27 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/bleu/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015 Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/bleu/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/bleu/bleu.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # File Name : bleu.py 4 | # 5 | # Description : Wrapper for BLEU scorer. 6 | # 7 | # Creation Date : 06-01-2015 8 | # Last Modified : Thu 19 Mar 2015 09:13:28 PM PDT 9 | # Authors : Hao Fang and Tsung-Yi Lin 10 | 11 | from bleu_scorer import BleuScorer 12 | 13 | 14 | class Bleu: 15 | def __init__(self, n=4): 16 | # default compute Blue score up to 4 17 | self._n = n 18 | self._hypo_for_image = {} 19 | self.ref_for_image = {} 20 | 21 | def compute_score(self, gts, res): 22 | 23 | assert(gts.keys() == res.keys()) 24 | imgIds = gts.keys() 25 | 26 | bleu_scorer = BleuScorer(n=self._n) 27 | for id in imgIds: 28 | hypo = res[id] 29 | ref = gts[id] 30 | 31 | # Sanity check. 32 | assert(type(hypo) is list) 33 | assert(len(hypo) == 1) 34 | assert(type(ref) is list) 35 | assert(len(ref) >= 1) 36 | 37 | bleu_scorer += (hypo[0], ref) 38 | 39 | #score, scores = bleu_scorer.compute_score(option='shortest') 40 | score, scores = bleu_scorer.compute_score(option='closest', verbose=1) 41 | #score, scores = bleu_scorer.compute_score(option='average', verbose=1) 42 | 43 | # return (bleu, bleu_info) 44 | return score, scores 45 | 46 | def method(self): 47 | return "Bleu" 48 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/bleu/bleu_scorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # bleu_scorer.py 4 | # David Chiang 5 | 6 | # Copyright (c) 2004-2006 University of Maryland. All rights 7 | # reserved. Do not redistribute without permission from the 8 | # author. Not for commercial use. 9 | 10 | # Modified by: 11 | # Hao Fang 12 | # Tsung-Yi Lin 13 | 14 | '''Provides: 15 | cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test(). 16 | cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked(). 17 | ''' 18 | 19 | import copy 20 | import sys, math, re 21 | from collections import defaultdict 22 | 23 | def precook(s, n=4, out=False): 24 | """Takes a string as input and returns an object that can be given to 25 | either cook_refs or cook_test. This is optional: cook_refs and cook_test 26 | can take string arguments as well.""" 27 | words = s.split() 28 | counts = defaultdict(int) 29 | for k in xrange(1,n+1): 30 | for i in xrange(len(words)-k+1): 31 | ngram = tuple(words[i:i+k]) 32 | counts[ngram] += 1 33 | return (len(words), counts) 34 | 35 | def cook_refs(refs, eff=None, n=4): ## lhuang: oracle will call with "average" 36 | '''Takes a list of reference sentences for a single segment 37 | and returns an object that encapsulates everything that BLEU 38 | needs to know about them.''' 39 | 40 | reflen = [] 41 | maxcounts = {} 42 | for ref in refs: 43 | rl, counts = precook(ref, n) 44 | reflen.append(rl) 45 | for (ngram,count) in counts.iteritems(): 46 | maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 47 | 48 | # Calculate effective reference sentence length. 49 | if eff == "shortest": 50 | reflen = min(reflen) 51 | elif eff == "average": 52 | reflen = float(sum(reflen))/len(reflen) 53 | 54 | ## lhuang: N.B.: leave reflen computaiton to the very end!! 55 | 56 | ## lhuang: N.B.: in case of "closest", keep a list of reflens!! (bad design) 57 | 58 | return (reflen, maxcounts) 59 | 60 | def cook_test(test, (reflen, refmaxcounts), eff=None, n=4): 61 | '''Takes a test sentence and returns an object that 62 | encapsulates everything that BLEU needs to know about it.''' 63 | 64 | testlen, counts = precook(test, n, True) 65 | 66 | result = {} 67 | 68 | # Calculate effective reference sentence length. 69 | 70 | if eff == "closest": 71 | result["reflen"] = min((abs(l-testlen), l) for l in reflen)[1] 72 | else: ## i.e., "average" or "shortest" or None 73 | result["reflen"] = reflen 74 | 75 | result["testlen"] = testlen 76 | 77 | result["guess"] = [max(0,testlen-k+1) for k in xrange(1,n+1)] 78 | 79 | result['correct'] = [0]*n 80 | for (ngram, count) in counts.iteritems(): 81 | result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count) 82 | 83 | return result 84 | 85 | class BleuScorer(object): 86 | """Bleu scorer. 87 | """ 88 | 89 | __slots__ = "n", "crefs", "ctest", "_score", "_ratio", "_testlen", "_reflen", "special_reflen" 90 | # special_reflen is used in oracle (proportional effective ref len for a node). 91 | 92 | def copy(self): 93 | ''' copy the refs.''' 94 | new = BleuScorer(n=self.n) 95 | new.ctest = copy.copy(self.ctest) 96 | new.crefs = copy.copy(self.crefs) 97 | new._score = None 98 | return new 99 | 100 | def __init__(self, test=None, refs=None, n=4, special_reflen=None): 101 | ''' singular instance ''' 102 | 103 | self.n = n 104 | self.crefs = [] 105 | self.ctest = [] 106 | self.cook_append(test, refs) 107 | self.special_reflen = special_reflen 108 | 109 | def cook_append(self, test, refs): 110 | '''called by constructor and __iadd__ to avoid creating new instances.''' 111 | 112 | if refs is not None: 113 | self.crefs.append(cook_refs(refs)) 114 | if test is not None: 115 | cooked_test = cook_test(test, self.crefs[-1]) 116 | self.ctest.append(cooked_test) ## N.B.: -1 117 | else: 118 | self.ctest.append(None) # lens of crefs and ctest have to match 119 | 120 | self._score = None ## need to recompute 121 | 122 | def ratio(self, option=None): 123 | self.compute_score(option=option) 124 | return self._ratio 125 | 126 | def score_ratio(self, option=None): 127 | '''return (bleu, len_ratio) pair''' 128 | return (self.fscore(option=option), self.ratio(option=option)) 129 | 130 | def score_ratio_str(self, option=None): 131 | return "%.4f (%.2f)" % self.score_ratio(option) 132 | 133 | def reflen(self, option=None): 134 | self.compute_score(option=option) 135 | return self._reflen 136 | 137 | def testlen(self, option=None): 138 | self.compute_score(option=option) 139 | return self._testlen 140 | 141 | def retest(self, new_test): 142 | if type(new_test) is str: 143 | new_test = [new_test] 144 | assert len(new_test) == len(self.crefs), new_test 145 | self.ctest = [] 146 | for t, rs in zip(new_test, self.crefs): 147 | self.ctest.append(cook_test(t, rs)) 148 | self._score = None 149 | 150 | return self 151 | 152 | def rescore(self, new_test): 153 | ''' replace test(s) with new test(s), and returns the new score.''' 154 | 155 | return self.retest(new_test).compute_score() 156 | 157 | def size(self): 158 | assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest)) 159 | return len(self.crefs) 160 | 161 | def __iadd__(self, other): 162 | '''add an instance (e.g., from another sentence).''' 163 | 164 | if type(other) is tuple: 165 | ## avoid creating new BleuScorer instances 166 | self.cook_append(other[0], other[1]) 167 | else: 168 | assert self.compatible(other), "incompatible BLEUs." 169 | self.ctest.extend(other.ctest) 170 | self.crefs.extend(other.crefs) 171 | self._score = None ## need to recompute 172 | 173 | return self 174 | 175 | def compatible(self, other): 176 | return isinstance(other, BleuScorer) and self.n == other.n 177 | 178 | def single_reflen(self, option="average"): 179 | return self._single_reflen(self.crefs[0][0], option) 180 | 181 | def _single_reflen(self, reflens, option=None, testlen=None): 182 | 183 | if option == "shortest": 184 | reflen = min(reflens) 185 | elif option == "average": 186 | reflen = float(sum(reflens))/len(reflens) 187 | elif option == "closest": 188 | reflen = min((abs(l-testlen), l) for l in reflens)[1] 189 | else: 190 | assert False, "unsupported reflen option %s" % option 191 | 192 | return reflen 193 | 194 | def recompute_score(self, option=None, verbose=0): 195 | self._score = None 196 | return self.compute_score(option, verbose) 197 | 198 | def compute_score(self, option=None, verbose=0): 199 | n = self.n 200 | small = 1e-9 201 | tiny = 1e-15 ## so that if guess is 0 still return 0 202 | bleu_list = [[] for _ in range(n)] 203 | 204 | if self._score is not None: 205 | return self._score 206 | 207 | if option is None: 208 | option = "average" if len(self.crefs) == 1 else "closest" 209 | 210 | self._testlen = 0 211 | self._reflen = 0 212 | totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n} 213 | 214 | # for each sentence 215 | for comps in self.ctest: 216 | testlen = comps['testlen'] 217 | self._testlen += testlen 218 | 219 | if self.special_reflen is None: ## need computation 220 | reflen = self._single_reflen(comps['reflen'], option, testlen) 221 | else: 222 | reflen = self.special_reflen 223 | 224 | self._reflen += reflen 225 | 226 | for key in ['guess','correct']: 227 | for k in xrange(n): 228 | totalcomps[key][k] += comps[key][k] 229 | 230 | # append per image bleu score 231 | bleu = 1. 232 | for k in xrange(n): 233 | bleu *= (float(comps['correct'][k]) + tiny) \ 234 | /(float(comps['guess'][k]) + small) 235 | bleu_list[k].append(bleu ** (1./(k+1))) 236 | ratio = (testlen + tiny) / (reflen + small) ## N.B.: avoid zero division 237 | if ratio < 1: 238 | for k in xrange(n): 239 | bleu_list[k][-1] *= math.exp(1 - 1/ratio) 240 | 241 | if verbose > 1: 242 | print comps, reflen 243 | 244 | totalcomps['reflen'] = self._reflen 245 | totalcomps['testlen'] = self._testlen 246 | 247 | bleus = [] 248 | bleu = 1. 249 | for k in xrange(n): 250 | bleu *= float(totalcomps['correct'][k] + tiny) \ 251 | / (totalcomps['guess'][k] + small) 252 | bleus.append(bleu ** (1./(k+1))) 253 | ratio = (self._testlen + tiny) / (self._reflen + small) ## N.B.: avoid zero division 254 | if ratio < 1: 255 | for k in xrange(n): 256 | bleus[k] *= math.exp(1 - 1/ratio) 257 | 258 | if verbose > 0: 259 | print totalcomps 260 | print "ratio:", ratio 261 | 262 | self._score = bleus 263 | return self._score, bleu_list 264 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/cider/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/cider/cider.py: -------------------------------------------------------------------------------- 1 | # Filename: cider.py 2 | # 3 | # Description: Describes the class to compute the CIDEr (Consensus-Based Image Description Evaluation) Metric 4 | # by Vedantam, Zitnick, and Parikh (http://arxiv.org/abs/1411.5726) 5 | # 6 | # Creation Date: Sun Feb 8 14:16:54 2015 7 | # 8 | # Authors: Ramakrishna Vedantam and Tsung-Yi Lin 9 | 10 | from cider_scorer import CiderScorer 11 | import pdb 12 | 13 | class Cider: 14 | """ 15 | Main Class to compute the CIDEr metric 16 | 17 | """ 18 | def __init__(self, test=None, refs=None, n=4, sigma=6.0): 19 | # set cider to sum over 1 to 4-grams 20 | self._n = n 21 | # set the standard deviation parameter for gaussian penalty 22 | self._sigma = sigma 23 | 24 | def compute_score(self, gts, res): 25 | """ 26 | Main function to compute CIDEr score 27 | :param hypo_for_image (dict) : dictionary with key and value 28 | ref_for_image (dict) : dictionary with key and value 29 | :return: cider (float) : computed CIDEr score for the corpus 30 | """ 31 | 32 | assert(gts.keys() == res.keys()) 33 | imgIds = gts.keys() 34 | 35 | cider_scorer = CiderScorer(n=self._n, sigma=self._sigma) 36 | 37 | for id in imgIds: 38 | hypo = res[id] 39 | ref = gts[id] 40 | 41 | # Sanity check. 42 | assert(type(hypo) is list) 43 | assert(len(hypo) == 1) 44 | assert(type(ref) is list) 45 | assert(len(ref) > 0) 46 | 47 | cider_scorer += (hypo[0], ref) 48 | 49 | (score, scores) = cider_scorer.compute_score() 50 | 51 | return score, scores 52 | 53 | def method(self): 54 | return "CIDEr" -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/cider/cider_scorer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Tsung-Yi Lin 3 | # Ramakrishna Vedantam 4 | 5 | import copy 6 | from collections import defaultdict 7 | import numpy as np 8 | import pdb 9 | import math 10 | 11 | def precook(s, n=4, out=False): 12 | """ 13 | Takes a string as input and returns an object that can be given to 14 | either cook_refs or cook_test. This is optional: cook_refs and cook_test 15 | can take string arguments as well. 16 | :param s: string : sentence to be converted into ngrams 17 | :param n: int : number of ngrams for which representation is calculated 18 | :return: term frequency vector for occuring ngrams 19 | """ 20 | words = s.split() 21 | counts = defaultdict(int) 22 | for k in xrange(1,n+1): 23 | for i in xrange(len(words)-k+1): 24 | ngram = tuple(words[i:i+k]) 25 | counts[ngram] += 1 26 | return counts 27 | 28 | def cook_refs(refs, n=4): ## lhuang: oracle will call with "average" 29 | '''Takes a list of reference sentences for a single segment 30 | and returns an object that encapsulates everything that BLEU 31 | needs to know about them. 32 | :param refs: list of string : reference sentences for some image 33 | :param n: int : number of ngrams for which (ngram) representation is calculated 34 | :return: result (list of dict) 35 | ''' 36 | return [precook(ref, n) for ref in refs] 37 | 38 | def cook_test(test, n=4): 39 | '''Takes a test sentence and returns an object that 40 | encapsulates everything that BLEU needs to know about it. 41 | :param test: list of string : hypothesis sentence for some image 42 | :param n: int : number of ngrams for which (ngram) representation is calculated 43 | :return: result (dict) 44 | ''' 45 | return precook(test, n, True) 46 | 47 | class CiderScorer(object): 48 | """CIDEr scorer. 49 | """ 50 | 51 | def copy(self): 52 | ''' copy the refs.''' 53 | new = CiderScorer(n=self.n) 54 | new.ctest = copy.copy(self.ctest) 55 | new.crefs = copy.copy(self.crefs) 56 | return new 57 | 58 | def __init__(self, test=None, refs=None, n=4, sigma=6.0): 59 | ''' singular instance ''' 60 | self.n = n 61 | self.sigma = sigma 62 | self.crefs = [] 63 | self.ctest = [] 64 | self.document_frequency = defaultdict(float) 65 | self.cook_append(test, refs) 66 | self.ref_len = None 67 | 68 | def cook_append(self, test, refs): 69 | '''called by constructor and __iadd__ to avoid creating new instances.''' 70 | 71 | if refs is not None: 72 | self.crefs.append(cook_refs(refs)) 73 | if test is not None: 74 | self.ctest.append(cook_test(test)) ## N.B.: -1 75 | else: 76 | self.ctest.append(None) # lens of crefs and ctest have to match 77 | 78 | def size(self): 79 | assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (len(self.crefs), len(self.ctest)) 80 | return len(self.crefs) 81 | 82 | def __iadd__(self, other): 83 | '''add an instance (e.g., from another sentence).''' 84 | 85 | if type(other) is tuple: 86 | ## avoid creating new CiderScorer instances 87 | self.cook_append(other[0], other[1]) 88 | else: 89 | self.ctest.extend(other.ctest) 90 | self.crefs.extend(other.crefs) 91 | 92 | return self 93 | def compute_doc_freq(self): 94 | ''' 95 | Compute term frequency for reference data. 96 | This will be used to compute idf (inverse document frequency later) 97 | The term frequency is stored in the object 98 | :return: None 99 | ''' 100 | for refs in self.crefs: 101 | # refs, k ref captions of one image 102 | for ngram in set([ngram for ref in refs for (ngram,count) in ref.iteritems()]): 103 | self.document_frequency[ngram] += 1 104 | # maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 105 | 106 | def compute_cider(self): 107 | def counts2vec(cnts): 108 | """ 109 | Function maps counts of ngram to vector of tfidf weights. 110 | The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights. 111 | The n-th entry of array denotes length of n-grams. 112 | :param cnts: 113 | :return: vec (array of dict), norm (array of float), length (int) 114 | """ 115 | vec = [defaultdict(float) for _ in range(self.n)] 116 | length = 0 117 | norm = [0.0 for _ in range(self.n)] 118 | for (ngram,term_freq) in cnts.iteritems(): 119 | # give word count 1 if it doesn't appear in reference corpus 120 | df = np.log(max(1.0, self.document_frequency[ngram])) 121 | # ngram index 122 | n = len(ngram)-1 123 | # tf (term_freq) * idf (precomputed idf) for n-grams 124 | vec[n][ngram] = float(term_freq)*(self.ref_len - df) 125 | # compute norm for the vector. the norm will be used for computing similarity 126 | norm[n] += pow(vec[n][ngram], 2) 127 | 128 | if n == 1: 129 | length += term_freq 130 | norm = [np.sqrt(n) for n in norm] 131 | return vec, norm, length 132 | 133 | def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref): 134 | ''' 135 | Compute the cosine similarity of two vectors. 136 | :param vec_hyp: array of dictionary for vector corresponding to hypothesis 137 | :param vec_ref: array of dictionary for vector corresponding to reference 138 | :param norm_hyp: array of float for vector corresponding to hypothesis 139 | :param norm_ref: array of float for vector corresponding to reference 140 | :param length_hyp: int containing length of hypothesis 141 | :param length_ref: int containing length of reference 142 | :return: array of score for each n-grams cosine similarity 143 | ''' 144 | delta = float(length_hyp - length_ref) 145 | # measure consine similarity 146 | val = np.array([0.0 for _ in range(self.n)]) 147 | for n in range(self.n): 148 | # ngram 149 | for (ngram,count) in vec_hyp[n].iteritems(): 150 | # vrama91 : added clipping 151 | val[n] += min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram] 152 | 153 | if (norm_hyp[n] != 0) and (norm_ref[n] != 0): 154 | val[n] /= (norm_hyp[n]*norm_ref[n]) 155 | 156 | assert(not math.isnan(val[n])) 157 | # vrama91: added a length based gaussian penalty 158 | val[n] *= np.e**(-(delta**2)/(2*self.sigma**2)) 159 | return val 160 | 161 | # compute log reference length 162 | self.ref_len = np.log(float(len(self.crefs))) 163 | 164 | scores = [] 165 | for test, refs in zip(self.ctest, self.crefs): 166 | # compute vector for test captions 167 | vec, norm, length = counts2vec(test) 168 | # compute vector for ref captions 169 | score = np.array([0.0 for _ in range(self.n)]) 170 | for ref in refs: 171 | vec_ref, norm_ref, length_ref = counts2vec(ref) 172 | score += sim(vec, vec_ref, norm, norm_ref, length, length_ref) 173 | # change by vrama91 - mean of ngram scores, instead of sum 174 | score_avg = np.mean(score) 175 | # divide by number of references 176 | score_avg /= len(refs) 177 | # multiply score by 10 178 | score_avg *= 10.0 179 | # append score of an image to the score list 180 | scores.append(score_avg) 181 | return scores 182 | 183 | def compute_score(self, option=None, verbose=0): 184 | # compute idf 185 | self.compute_doc_freq() 186 | # assert to check document frequency 187 | assert(len(self.ctest) >= max(self.document_frequency.values())) 188 | # compute cider score 189 | score = self.compute_cider() 190 | # debug 191 | # print score 192 | return np.mean(np.array(score)), np.array(score) -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/eval.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | from tokenizer.ptbtokenizer import PTBTokenizer 3 | from bleu.bleu import Bleu 4 | from meteor.meteor import Meteor 5 | from rouge.rouge import Rouge 6 | from cider.cider import Cider 7 | 8 | class COCOEvalCap: 9 | def __init__(self, coco, cocoRes): 10 | self.evalImgs = [] 11 | self.eval = {} 12 | self.imgToEval = {} 13 | self.coco = coco 14 | self.cocoRes = cocoRes 15 | self.params = {'image_id': coco.getImgIds()} 16 | 17 | def evaluate(self): 18 | imgIds = self.params['image_id'] 19 | # imgIds = self.coco.getImgIds() 20 | gts = {} 21 | res = {} 22 | for imgId in imgIds: 23 | gts[imgId] = self.coco.imgToAnns[imgId] 24 | res[imgId] = self.cocoRes.imgToAnns[imgId] 25 | 26 | # ================================================= 27 | # Set up scorers 28 | # ================================================= 29 | print 'tokenization...' 30 | tokenizer = PTBTokenizer() 31 | gts = tokenizer.tokenize(gts) 32 | res = tokenizer.tokenize(res) 33 | 34 | # ================================================= 35 | # Set up scorers 36 | # ================================================= 37 | print 'setting up scorers...' 38 | scorers = [ 39 | (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]), 40 | (Meteor(),"METEOR"), 41 | (Rouge(), "ROUGE_L"), 42 | (Cider(), "CIDEr") 43 | ] 44 | 45 | # ================================================= 46 | # Compute scores 47 | # ================================================= 48 | for scorer, method in scorers: 49 | print 'computing %s score...'%(scorer.method()) 50 | score, scores = scorer.compute_score(gts, res) 51 | if type(method) == list: 52 | for sc, scs, m in zip(score, scores, method): 53 | self.setEval(sc, m) 54 | self.setImgToEvalImgs(scs, gts.keys(), m) 55 | print "%s: %0.3f"%(m, sc) 56 | else: 57 | self.setEval(score, method) 58 | self.setImgToEvalImgs(scores, gts.keys(), method) 59 | print "%s: %0.3f"%(method, score) 60 | self.setEvalImgs() 61 | 62 | def setEval(self, score, method): 63 | self.eval[method] = score 64 | 65 | def setImgToEvalImgs(self, scores, imgIds, method): 66 | for imgId, score in zip(imgIds, scores): 67 | if not imgId in self.imgToEval: 68 | self.imgToEval[imgId] = {} 69 | self.imgToEval[imgId]["image_id"] = imgId 70 | self.imgToEval[imgId][method] = score 71 | 72 | def setEvalImgs(self): 73 | self.evalImgs = [eval for imgId, eval in self.imgToEval.items()] -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/meteor/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'tylin' 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/meteor/data/paraphrase-en.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/coco/pycocoevalcap/meteor/data/paraphrase-en.gz -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/meteor/meteor-1.5.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/coco/pycocoevalcap/meteor/meteor-1.5.jar -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/meteor/meteor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Python wrapper for METEOR implementation, by Xinlei Chen 4 | # Acknowledge Michael Denkowski for the generous discussion and help 5 | 6 | import os 7 | import sys 8 | import subprocess 9 | import threading 10 | 11 | # Assumes meteor-1.5.jar is in the same directory as meteor.py. Change as needed. 12 | METEOR_JAR = 'meteor-1.5.jar' 13 | # print METEOR_JAR 14 | 15 | class Meteor: 16 | 17 | def __init__(self): 18 | self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, \ 19 | '-', '-', '-stdio', '-l', 'en', '-norm'] 20 | self.meteor_p = subprocess.Popen(self.meteor_cmd, \ 21 | cwd=os.path.dirname(os.path.abspath(__file__)), \ 22 | stdin=subprocess.PIPE, \ 23 | stdout=subprocess.PIPE, \ 24 | stderr=subprocess.PIPE) 25 | # Used to guarantee thread safety 26 | self.lock = threading.Lock() 27 | 28 | def compute_score(self, gts, res): 29 | assert(gts.keys() == res.keys()) 30 | imgIds = gts.keys() 31 | scores = [] 32 | 33 | eval_line = 'EVAL' 34 | self.lock.acquire() 35 | for i in imgIds: 36 | assert(len(res[i]) == 1) 37 | stat = self._stat(res[i][0], gts[i]) 38 | eval_line += ' ||| {}'.format(stat) 39 | 40 | self.meteor_p.stdin.write('{}\n'.format(eval_line)) 41 | for i in range(0,len(imgIds)): 42 | scores.append(float(self.meteor_p.stdout.readline().strip())) 43 | score = float(self.meteor_p.stdout.readline().strip()) 44 | self.lock.release() 45 | 46 | return score, scores 47 | 48 | def method(self): 49 | return "METEOR" 50 | 51 | def _stat(self, hypothesis_str, reference_list): 52 | # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words 53 | hypothesis_str = hypothesis_str.replace('|||','').replace(' ',' ') 54 | score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str)) 55 | self.meteor_p.stdin.write('{}\n'.format(score_line)) 56 | return self.meteor_p.stdout.readline().strip() 57 | 58 | def _score(self, hypothesis_str, reference_list): 59 | self.lock.acquire() 60 | # SCORE ||| reference 1 words ||| reference n words ||| hypothesis words 61 | hypothesis_str = hypothesis_str.replace('|||','').replace(' ',' ') 62 | score_line = ' ||| '.join(('SCORE', ' ||| '.join(reference_list), hypothesis_str)) 63 | self.meteor_p.stdin.write('{}\n'.format(score_line)) 64 | stats = self.meteor_p.stdout.readline().strip() 65 | eval_line = 'EVAL ||| {}'.format(stats) 66 | # EVAL ||| stats 67 | self.meteor_p.stdin.write('{}\n'.format(eval_line)) 68 | score = float(self.meteor_p.stdout.readline().strip()) 69 | # bug fix: there are two values returned by the jar file, one average, and one all, so do it twice 70 | # thanks for Andrej for pointing this out 71 | score = float(self.meteor_p.stdout.readline().strip()) 72 | self.lock.release() 73 | return score 74 | 75 | def __exit__(self): 76 | self.lock.acquire() 77 | self.meteor_p.stdin.close() 78 | self.meteor_p.kill() 79 | self.meteor_p.wait() 80 | self.lock.release() 81 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/readme.md: -------------------------------------------------------------------------------- 1 | This is the MS COCO caption evaluation API downloaded from https://github.com/tylin/coco-caption. 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/rouge/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'vrama91' 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/rouge/rouge.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # File Name : rouge.py 4 | # 5 | # Description : Computes ROUGE-L metric as described by Lin and Hovey (2004) 6 | # 7 | # Creation Date : 2015-01-07 06:03 8 | # Author : Ramakrishna Vedantam 9 | 10 | import numpy as np 11 | import pdb 12 | 13 | def my_lcs(string, sub): 14 | """ 15 | Calculates longest common subsequence for a pair of tokenized strings 16 | :param string : list of str : tokens from a string split using whitespace 17 | :param sub : list of str : shorter string, also split using whitespace 18 | :returns: length (list of int): length of the longest common subsequence between the two strings 19 | 20 | Note: my_lcs only gives length of the longest common subsequence, not the actual LCS 21 | """ 22 | if(len(string)< len(sub)): 23 | sub, string = string, sub 24 | 25 | lengths = [[0 for i in range(0,len(sub)+1)] for j in range(0,len(string)+1)] 26 | 27 | for j in range(1,len(sub)+1): 28 | for i in range(1,len(string)+1): 29 | if(string[i-1] == sub[j-1]): 30 | lengths[i][j] = lengths[i-1][j-1] + 1 31 | else: 32 | lengths[i][j] = max(lengths[i-1][j] , lengths[i][j-1]) 33 | 34 | return lengths[len(string)][len(sub)] 35 | 36 | class Rouge(): 37 | ''' 38 | Class for computing ROUGE-L score for a set of candidate sentences for the MS COCO test set 39 | 40 | ''' 41 | def __init__(self): 42 | # vrama91: updated the value below based on discussion with Hovey 43 | self.beta = 1.2 44 | 45 | def calc_score(self, candidate, refs): 46 | """ 47 | Compute ROUGE-L score given one candidate and references for an image 48 | :param candidate: str : candidate sentence to be evaluated 49 | :param refs: list of str : COCO reference sentences for the particular image to be evaluated 50 | :returns score: int (ROUGE-L score for the candidate evaluated against references) 51 | """ 52 | assert(len(candidate)==1) 53 | assert(len(refs)>0) 54 | prec = [] 55 | rec = [] 56 | 57 | # split into tokens 58 | token_c = candidate[0].split(" ") 59 | 60 | for reference in refs: 61 | # split into tokens 62 | token_r = reference.split(" ") 63 | # compute the longest common subsequence 64 | lcs = my_lcs(token_r, token_c) 65 | prec.append(lcs/float(len(token_c))) 66 | rec.append(lcs/float(len(token_r))) 67 | 68 | prec_max = max(prec) 69 | rec_max = max(rec) 70 | 71 | if(prec_max!=0 and rec_max !=0): 72 | score = ((1 + self.beta**2)*prec_max*rec_max)/float(rec_max + self.beta**2*prec_max) 73 | else: 74 | score = 0.0 75 | return score 76 | 77 | def compute_score(self, gts, res): 78 | """ 79 | Computes Rouge-L score given a set of reference and candidate sentences for the dataset 80 | Invoked by evaluate_captions.py 81 | :param hypo_for_image: dict : candidate / test sentences with "image name" key and "tokenized sentences" as values 82 | :param ref_for_image: dict : reference MS-COCO sentences with "image name" key and "tokenized sentences" as values 83 | :returns: average_score: float (mean ROUGE-L score computed by averaging scores for all the images) 84 | """ 85 | assert(gts.keys() == res.keys()) 86 | imgIds = gts.keys() 87 | 88 | score = [] 89 | for id in imgIds: 90 | hypo = res[id] 91 | ref = gts[id] 92 | 93 | score.append(self.calc_score(hypo, ref)) 94 | 95 | # Sanity check. 96 | assert(type(hypo) is list) 97 | assert(len(hypo) == 1) 98 | assert(type(ref) is list) 99 | assert(len(ref) > 0) 100 | 101 | average_score = np.mean(np.array(score)) 102 | return average_score, np.array(score) 103 | 104 | def method(self): 105 | return "Rouge" 106 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/tokenizer/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'hfang' 2 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/tokenizer/ptbtokenizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # File Name : ptbtokenizer.py 4 | # 5 | # Description : Do the PTB Tokenization and remove punctuations. 6 | # 7 | # Creation Date : 29-12-2014 8 | # Last Modified : Thu Mar 19 09:53:35 2015 9 | # Authors : Hao Fang and Tsung-Yi Lin 10 | 11 | import os 12 | import sys 13 | import subprocess 14 | import tempfile 15 | import itertools 16 | 17 | # path to the stanford corenlp jar 18 | STANFORD_CORENLP_3_4_1_JAR = 'stanford-corenlp-3.4.1.jar' 19 | 20 | # punctuations to be removed from the sentences 21 | PUNCTUATIONS = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \ 22 | ".", "?", "!", ",", ":", "-", "--", "...", ";"] 23 | 24 | class PTBTokenizer: 25 | """Python wrapper of Stanford PTBTokenizer""" 26 | 27 | def tokenize(self, captions_for_image): 28 | cmd = ['java', '-cp', STANFORD_CORENLP_3_4_1_JAR, \ 29 | 'edu.stanford.nlp.process.PTBTokenizer', \ 30 | '-preserveLines', '-lowerCase'] 31 | 32 | # ====================================================== 33 | # prepare data for PTB Tokenizer 34 | # ====================================================== 35 | final_tokenized_captions_for_image = {} 36 | image_id = [k for k, v in captions_for_image.items() for _ in range(len(v))] 37 | sentences = '\n'.join([c['caption'].replace('\n', ' ') for k, v in captions_for_image.items() for c in v]) 38 | 39 | # ====================================================== 40 | # save sentences to temporary file 41 | # ====================================================== 42 | path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__)) 43 | tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname) 44 | tmp_file.write(sentences) 45 | tmp_file.close() 46 | 47 | # ====================================================== 48 | # tokenize sentence 49 | # ====================================================== 50 | cmd.append(os.path.basename(tmp_file.name)) 51 | p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \ 52 | stdout=subprocess.PIPE) 53 | token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0] 54 | lines = token_lines.split('\n') 55 | # remove temp file 56 | os.remove(tmp_file.name) 57 | 58 | # ====================================================== 59 | # create dictionary for tokenized captions 60 | # ====================================================== 61 | for k, line in zip(image_id, lines): 62 | if not k in final_tokenized_captions_for_image: 63 | final_tokenized_captions_for_image[k] = [] 64 | tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \ 65 | if w not in PUNCTUATIONS]) 66 | final_tokenized_captions_for_image[k].append(tokenized_caption) 67 | 68 | return final_tokenized_captions_for_image 69 | -------------------------------------------------------------------------------- /utils/coco/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/coco/pycocoevalcap/tokenizer/stanford-corenlp-3.4.1.jar -------------------------------------------------------------------------------- /utils/coco/readme.md: -------------------------------------------------------------------------------- 1 | This is the MS COCO API downloaded from https://github.com/pdollar/coco. I have slightly modified it for convenience reasons. 2 | -------------------------------------------------------------------------------- /utils/ilsvrc_2012_mean.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DeepRNN/image_captioning/ee6936c3a1a8872ae7b055cfc8762fa323b01412/utils/ilsvrc_2012_mean.npy -------------------------------------------------------------------------------- /utils/misc.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import cv2 4 | import heapq 5 | 6 | class ImageLoader(object): 7 | def __init__(self, mean_file): 8 | self.bgr = True 9 | self.scale_shape = np.array([224, 224], np.int32) 10 | self.crop_shape = np.array([224, 224], np.int32) 11 | self.mean = np.load(mean_file).mean(1).mean(1) 12 | 13 | def load_image(self, image_file): 14 | """ Load and preprocess an image. """ 15 | image = cv2.imread(image_file) 16 | 17 | if self.bgr: 18 | temp = image.swapaxes(0, 2) 19 | temp = temp[::-1] 20 | image = temp.swapaxes(0, 2) 21 | 22 | image = cv2.resize(image, (self.scale_shape[0], self.scale_shape[1])) 23 | offset = (self.scale_shape - self.crop_shape) / 2 24 | offset = offset.astype(np.int32) 25 | image = image[offset[0]:offset[0]+self.crop_shape[0], 26 | offset[1]:offset[1]+self.crop_shape[1]] 27 | image = image - self.mean 28 | return image 29 | 30 | def load_images(self, image_files): 31 | """ Load and preprocess a list of images. """ 32 | images = [] 33 | for image_file in image_files: 34 | images.append(self.load_image(image_file)) 35 | images = np.array(images, np.float32) 36 | return images 37 | 38 | class CaptionData(object): 39 | def __init__(self, sentence, memory, output, score): 40 | self.sentence = sentence 41 | self.memory = memory 42 | self.output = output 43 | self.score = score 44 | 45 | def __cmp__(self, other): 46 | assert isinstance(other, CaptionData) 47 | if self.score == other.score: 48 | return 0 49 | elif self.score < other.score: 50 | return -1 51 | else: 52 | return 1 53 | 54 | def __lt__(self, other): 55 | assert isinstance(other, CaptionData) 56 | return self.score < other.score 57 | 58 | def __eq__(self, other): 59 | assert isinstance(other, CaptionData) 60 | return self.score == other.score 61 | 62 | class TopN(object): 63 | def __init__(self, n): 64 | self._n = n 65 | self._data = [] 66 | 67 | def size(self): 68 | assert self._data is not None 69 | return len(self._data) 70 | 71 | def push(self, x): 72 | assert self._data is not None 73 | if len(self._data) < self._n: 74 | heapq.heappush(self._data, x) 75 | else: 76 | heapq.heappushpop(self._data, x) 77 | 78 | def extract(self, sort=False): 79 | assert self._data is not None 80 | data = self._data 81 | self._data = None 82 | if sort: 83 | data.sort(reverse=True) 84 | return data 85 | 86 | def reset(self): 87 | self._data = [] 88 | -------------------------------------------------------------------------------- /utils/nn.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import tensorflow.contrib.layers as layers 3 | 4 | class NN(object): 5 | def __init__(self, config): 6 | self.config = config 7 | self.is_train = True if config.phase == 'train' else False 8 | self.train_cnn = self.is_train and config.train_cnn 9 | self.prepare() 10 | 11 | def prepare(self): 12 | """ Setup the weight initalizers and regularizers. """ 13 | config = self.config 14 | 15 | self.conv_kernel_initializer = layers.xavier_initializer() 16 | 17 | if self.train_cnn and config.conv_kernel_regularizer_scale > 0: 18 | self.conv_kernel_regularizer = layers.l2_regularizer( 19 | scale = config.conv_kernel_regularizer_scale) 20 | else: 21 | self.conv_kernel_regularizer = None 22 | 23 | if self.train_cnn and config.conv_activity_regularizer_scale > 0: 24 | self.conv_activity_regularizer = layers.l1_regularizer( 25 | scale = config.conv_activity_regularizer_scale) 26 | else: 27 | self.conv_activity_regularizer = None 28 | 29 | self.fc_kernel_initializer = tf.random_uniform_initializer( 30 | minval = -config.fc_kernel_initializer_scale, 31 | maxval = config.fc_kernel_initializer_scale) 32 | 33 | if self.is_train and config.fc_kernel_regularizer_scale > 0: 34 | self.fc_kernel_regularizer = layers.l2_regularizer( 35 | scale = config.fc_kernel_regularizer_scale) 36 | else: 37 | self.fc_kernel_regularizer = None 38 | 39 | if self.is_train and config.fc_activity_regularizer_scale > 0: 40 | self.fc_activity_regularizer = layers.l1_regularizer( 41 | scale = config.fc_activity_regularizer_scale) 42 | else: 43 | self.fc_activity_regularizer = None 44 | 45 | def conv2d(self, 46 | inputs, 47 | filters, 48 | kernel_size = (3, 3), 49 | strides = (1, 1), 50 | activation = tf.nn.relu, 51 | use_bias = True, 52 | name = None): 53 | """ 2D Convolution layer. """ 54 | if activation is not None: 55 | activity_regularizer = self.conv_activity_regularizer 56 | else: 57 | activity_regularizer = None 58 | return tf.layers.conv2d( 59 | inputs = inputs, 60 | filters = filters, 61 | kernel_size = kernel_size, 62 | strides = strides, 63 | padding='same', 64 | activation = activation, 65 | use_bias = use_bias, 66 | trainable = self.train_cnn, 67 | kernel_initializer = self.conv_kernel_initializer, 68 | kernel_regularizer = self.conv_kernel_regularizer, 69 | activity_regularizer = activity_regularizer, 70 | name = name) 71 | 72 | def max_pool2d(self, 73 | inputs, 74 | pool_size = (2, 2), 75 | strides = (2, 2), 76 | name = None): 77 | """ 2D Max Pooling layer. """ 78 | return tf.layers.max_pooling2d( 79 | inputs = inputs, 80 | pool_size = pool_size, 81 | strides = strides, 82 | padding='same', 83 | name = name) 84 | 85 | def dense(self, 86 | inputs, 87 | units, 88 | activation = tf.tanh, 89 | use_bias = True, 90 | name = None): 91 | """ Fully-connected layer. """ 92 | if activation is not None: 93 | activity_regularizer = self.fc_activity_regularizer 94 | else: 95 | activity_regularizer = None 96 | return tf.layers.dense( 97 | inputs = inputs, 98 | units = units, 99 | activation = activation, 100 | use_bias = use_bias, 101 | trainable = self.is_train, 102 | kernel_initializer = self.fc_kernel_initializer, 103 | kernel_regularizer = self.fc_kernel_regularizer, 104 | activity_regularizer = activity_regularizer, 105 | name = name) 106 | 107 | def dropout(self, 108 | inputs, 109 | name = None): 110 | """ Dropout layer. """ 111 | return tf.layers.dropout( 112 | inputs = inputs, 113 | rate = self.config.fc_drop_rate, 114 | training = self.is_train) 115 | 116 | def batch_norm(self, 117 | inputs, 118 | name = None): 119 | """ Batch normalization layer. """ 120 | return tf.layers.batch_normalization( 121 | inputs = inputs, 122 | training = self.train_cnn, 123 | trainable = self.train_cnn, 124 | name = name 125 | ) 126 | -------------------------------------------------------------------------------- /utils/vocabulary.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import pandas as pd 4 | from tqdm import tqdm 5 | import string 6 | from nltk.tokenize import word_tokenize 7 | 8 | class Vocabulary(object): 9 | def __init__(self, size, save_file=None): 10 | self.words = [] 11 | self.word2idx = {} 12 | self.word_frequencies = [] 13 | self.size = size 14 | if save_file is not None: 15 | self.load(save_file) 16 | 17 | def build(self, sentences): 18 | """ Build the vocabulary and compute the frequency of each word. """ 19 | word_counts = {} 20 | for sentence in tqdm(sentences): 21 | for w in word_tokenize(sentence.lower()): 22 | word_counts[w] = word_counts.get(w, 0) + 1.0 23 | 24 | assert self.size-1 <= len(word_counts.keys()) 25 | self.words.append('') 26 | self.word2idx[''] = 0 27 | self.word_frequencies.append(1.0) 28 | 29 | word_counts = sorted(list(word_counts.items()), 30 | key=lambda x: x[1], 31 | reverse=True) 32 | 33 | for idx in range(self.size-1): 34 | word, frequency = word_counts[idx] 35 | self.words.append(word) 36 | self.word2idx[word] = idx + 1 37 | self.word_frequencies.append(frequency) 38 | 39 | self.word_frequencies = np.array(self.word_frequencies) 40 | self.word_frequencies /= np.sum(self.word_frequencies) 41 | self.word_frequencies = np.log(self.word_frequencies) 42 | self.word_frequencies -= np.max(self.word_frequencies) 43 | 44 | def process_sentence(self, sentence): 45 | """ Tokenize a sentence, and translate each token into its index 46 | in the vocabulary. """ 47 | words = word_tokenize(sentence.lower()) 48 | word_idxs = [self.word2idx[w] for w in words] 49 | return word_idxs 50 | 51 | def get_sentence(self, idxs): 52 | """ Translate a vector of indicies into a sentence. """ 53 | words = [self.words[i] for i in idxs] 54 | if words[-1] != '.': 55 | words.append('.') 56 | length = np.argmax(np.array(words)=='.') + 1 57 | words = words[:length] 58 | sentence = "".join([" "+w if not w.startswith("'") \ 59 | and w not in string.punctuation \ 60 | else w for w in words]).strip() 61 | return sentence 62 | 63 | def save(self, save_file): 64 | """ Save the vocabulary to a file. """ 65 | data = pd.DataFrame({'word': self.words, 66 | 'index': list(range(self.size)), 67 | 'frequency': self.word_frequencies}) 68 | data.to_csv(save_file) 69 | 70 | def load(self, save_file): 71 | """ Load the vocabulary from a file. """ 72 | assert os.path.exists(save_file) 73 | data = pd.read_csv(save_file) 74 | self.words = data['word'].values 75 | self.word2idx = {self.words[i]:i for i in range(self.size)} 76 | self.word_frequencies = data['frequency'].values 77 | -------------------------------------------------------------------------------- /val/images/readme: -------------------------------------------------------------------------------- 1 | Put the COCO val2014 images here. 2 | -------------------------------------------------------------------------------- /val/readme: -------------------------------------------------------------------------------- 1 | Put the file captions_val2014.json here. 2 | --------------------------------------------------------------------------------