├── README.md └── image2txt model ├── README.md ├── ShowAndTellModel.py ├── caption_generator.py ├── coco_utils.py ├── configuration.py ├── extract_features.py ├── image_utils.py ├── inference_on_folder_beam.py ├── inference_on_folder_sample.py ├── prepare_captions.py ├── prepare_glove_matrix.py ├── rename_images_in_sequence.py ├── test.py └── train.py /README.md: -------------------------------------------------------------------------------- 1 | # Image Captioning Model in TensorFlow 2 | 3 | This repo follows the blog post from here: 4 | https://vanishingcodes.wordpress.com/2017/03/20/using-tensorflow-to-build-image-to-text-deep-learning-application/ 5 | 6 | You can find Image Captioning model implemented using TensorFlow in this repo. The model trains on MSCOCO data set which is downloadable from link: 7 | http://mscoco.org/dataset/#download 8 | 9 | The model is a simplified version of Google's ShowAndTell model: https://github.com/tensorflow/models/tree/master/im2txt#prepare-the-training-data 10 | 11 | Basically, the model will extract all image features and save in numpy arrays to local first, and then build the LSTM to train on those features. However, this turned out to be time saving (at the cost of some accuracy, as we are not fine tuning the image model, the encoder, while training LSTM as opposed to Google's approach) 12 | 13 | In addition, there are some other differences such as: 14 | 1. Not using emsembling. 15 | 2. Not using partially guided training 16 | 3. Not using BLEU score to monitor validation 17 | 18 | ## How to run the scripts 19 | 20 | 1. Download captions_train2014.json, captions_val2014.json from the link above, as well as train2014(80K) and val2014(40K) images. Save the json files into a folder named ../COCO_captioning/, and train and val images into ../train2014/ and ../val2014/ separately 21 | 22 | 2. Run command line below to prepare data sets. 23 | ```shell 24 | python prepare_captions.py --file_dir /home/ubuntu/COCO/dataset/COCO_captioning/ --total_vocab 2000 --padding_len 25 25 | ``` 26 | 3. Run command lines below to extract features using Inception V3 pretrained model, and save them to train2014_v3_pool_3.npy and val2014_v3_pool_3.npy 27 | ```shell 28 | python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/train2014 --save_dir /home/ubuntu/COCO/dataset/COCO_captioning/train2014_v3_pool_3 29 | 30 | python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/COCO_captioning/val2014_v3_pool_3 31 | ``` 32 | 4. Build and train the model 33 | ```shell 34 | python train.py -–savedSession_dir [dir where your sessions will be saved] –-data_dir [dir where all training data are saved, which was generated in step 2] –-glove_vocab [path to GloVe word vectors, none if not needed] --sample_dir [dir to save all intermediate validation sample images during training] –-print_every [num of steps to print training loss. 0 for not printing] –-sample_every [num of steps to generate captions on validation images. 0 for not sampling] –-saveModel_every [num of steps to save the model checkpoint. 0 for not saving.] 35 | ``` 36 | 5. Inference on test data folder - using beam search 37 | ```shell 38 | python inference_on_folder_beam.py –-pretrain_dir [path to pretrained v3 model; if not found, will download from web] –-test_dir [path to dir of test images you want to inference on] –-results_dir [path to dir where the test results will be saved] –-saved_sess [saved check point] –-dict_file [path to dictionary file generated in step 2, e.g. coco2014_vocab.json] 39 | ``` 40 | -------------------------------------------------------------------------------- /image2txt model/README.md: -------------------------------------------------------------------------------- 1 | ## Steps to run 2 | 3 | 1: run prepare_captions.py to get the coco2014_captions.h5 which contains train_captions, train_image_idx, val_captions, val_image_idx, train and val image urls files, as well as two dict files - train_image_id_to_idx.csv, val_image_id_to_idx.csv 4 | 5 | ```shell 6 | python prepare_captions.py --file_dir /home/ubuntu/COCO/dataset/COCO_captioning/ --total_vocab 2000 --padding_len 25 7 | ``` 8 | 9 | 2: run rename_images_in_sequence.py to change all image names in both train2014 and val2014, using the two dict files generated in previous step. You only need to do this step once and for all! 10 | 11 | ```shell 12 | python rename_images_in_sequence.py --dict_dir /home/ubuntu/COCO/dataset/COCO_captioning/train_image_id_to_idx.csv --image_dir /home/ubuntu/COCO/dataset/train2014 13 | python rename_images_in_sequence.py --dict_dir /home/ubuntu/COCO/dataset/COCO_captioning/val_image_id_to_idx.csv --image_dir /home/ubuntu/COCO/dataset/val2014 14 | ``` 15 | 16 | 3: extract features! run extract_features.py to extract inception v3 features for each of the train2014 and val2014 images in sequence. You only need to do this step once and for all! 17 | 18 | ```shell 19 | python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/val2014_v3_pool_3 --verbose 500 20 | python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/val2014_v3_pool_3 --verbose 500 21 | ``` 22 | -------------------------------------------------------------------------------- /image2txt model/ShowAndTellModel.py: -------------------------------------------------------------------------------- 1 | 2 | """Builds the model. 3 | 4 | Inputs: 5 | image_feature 6 | input_seqs 7 | keep_prob 8 | target_seqs 9 | input_mask 10 | Outputs: 11 | total_loss 12 | preds 13 | """ 14 | 15 | import tensorflow as tf 16 | 17 | def build_model(config, mode, inference_batch = None, glove_vocab = None): 18 | 19 | """Basic setup. 20 | 21 | Args: 22 | config: Object containing configuration parameters. 23 | mode: "train" or "inference". 24 | inference_batch: if mode is 'inference', we will need to provide the batch_size of input data. Otherwise, leave it as None. 25 | glove_vocab: if we need to use glove word2vec to initialize our vocab embeddings, we will provide with a matrix of [config.vocab_size, config.embedding_size]. If not, we leave it as None. 26 | """ 27 | assert mode in ["train", "inference"] 28 | if mode == 'inference' and inference_batch is None: 29 | raise ValueError("When inference mode, inference_batch must be provided!") 30 | config = config 31 | 32 | # To match the "Show and Tell" paper we initialize all variables with a 33 | # random uniform initializer. 34 | initializer = tf.random_uniform_initializer( 35 | minval=-config.initializer_scale, 36 | maxval=config.initializer_scale) 37 | 38 | # An int32 Tensor with shape [batch_size, padded_length]. 39 | input_seqs = tf.placeholder(tf.int32, [None, None], name='input_seqs') 40 | 41 | # An int32 Tensor with shape [batch_size, padded_length]. 42 | target_seqs = tf.placeholder(tf.int32, [None, None], name='target_seqs') 43 | 44 | # A float32 Tensor with shape [1] 45 | keep_prob = tf.placeholder(tf.float32, name='keep_prob') 46 | 47 | # An int32 0/1 Tensor with shape [batch_size, padded_length]. 48 | input_mask = tf.placeholder(tf.int32, [None, None], name='input_mask') 49 | 50 | # A float32 Tensor with shape [batch_size, image_feature_size]. 51 | image_feature = tf.placeholder(tf.float32, [None, config.image_feature_size], name='image_feature') 52 | 53 | # A float32 Tensor with shape [batch_size, padded_length, embedding_size]. 54 | seq_embedding = None 55 | 56 | # A float32 scalar Tensor; the total loss for the trainer to optimize. 57 | total_loss = None 58 | 59 | # A float32 Tensor with shape [batch_size * padded_length]. 60 | target_cross_entropy_losses = None 61 | 62 | # A float32 Tensor with shape [batch_size * padded_length]. 63 | target_cross_entropy_loss_weights = None 64 | 65 | # Collection of variables from the inception submodel. 66 | inception_variables = [] 67 | 68 | # Global step Tensor. 69 | global_step = None 70 | 71 | """Sets up the global step Tensor.""" 72 | global_step = tf.Variable( 73 | initial_value=0, 74 | name="global_step", 75 | trainable=False, 76 | collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES]) 77 | 78 | ### Builds the input sequence embeddings ### 79 | # Inputs: 80 | # self.input_seqs 81 | # Outputs: 82 | # self.seq_embeddings 83 | ############################################ 84 | 85 | with tf.variable_scope("seq_embedding"), tf.device("/cpu:0"): 86 | if glove_vocab is None: 87 | embedding_map = tf.get_variable( 88 | name="map", 89 | shape=[config.vocab_size, config.embedding_size], 90 | initializer=initializer) 91 | else: 92 | init = tf.constant(glove_vocab.astype('float32')) 93 | embedding_map = tf.get_variable( 94 | name="map", 95 | initializer=init) 96 | seq_embedding = tf.nn.embedding_lookup(embedding_map, input_seqs) 97 | 98 | ############ Builds the model ############## 99 | # Inputs: 100 | # self.image_feature 101 | # self.seq_embeddings 102 | # self.target_seqs (training and eval only) 103 | # self.input_mask (training and eval only) 104 | # Outputs: 105 | # self.total_loss (training and eval only) 106 | # self.target_cross_entropy_losses (training and eval only) 107 | # self.target_cross_entropy_loss_weights (training and eval only) 108 | ############################################ 109 | 110 | lstm_cell = tf.nn.rnn_cell.LSTMCell( 111 | num_units=config.num_lstm_units, state_is_tuple=True) 112 | 113 | lstm_cell = tf.nn.rnn_cell.DropoutWrapper( 114 | lstm_cell, 115 | input_keep_prob=keep_prob, 116 | output_keep_prob=keep_prob) 117 | 118 | with tf.variable_scope("lstm", initializer=initializer) as lstm_scope: 119 | 120 | # Feed the image embeddings to set the initial LSTM state. 121 | if mode == 'train': 122 | zero_state = lstm_cell.zero_state( 123 | batch_size=config.batch_size, dtype=tf.float32) 124 | elif mode == 'inference': 125 | zero_state = lstm_cell.zero_state( 126 | batch_size=inference_batch, dtype=tf.float32) 127 | 128 | with tf.variable_scope('image_embeddings'): 129 | image_embeddings = tf.contrib.layers.fully_connected( 130 | inputs=image_feature, 131 | num_outputs=config.embedding_size, 132 | activation_fn=None, 133 | weights_initializer=initializer, 134 | biases_initializer=None) 135 | 136 | _, initial_state = lstm_cell(image_embeddings, zero_state) 137 | 138 | # Allow the LSTM variables to be reused. 139 | lstm_scope.reuse_variables() 140 | 141 | # Run the batch of sequence embeddings through the LSTM. 142 | sequence_length = tf.reduce_sum(input_mask, 1) 143 | lstm_outputs, final_state = tf.nn.dynamic_rnn(cell=lstm_cell, 144 | inputs=seq_embedding, 145 | sequence_length=sequence_length, 146 | initial_state=initial_state, 147 | dtype=tf.float32, 148 | scope=lstm_scope) 149 | 150 | # Stack batches vertically. 151 | lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size]) # output_size == 256 152 | 153 | with tf.variable_scope('logits'): 154 | W = tf.get_variable('W', [lstm_cell.output_size, config.vocab_size], initializer=initializer) 155 | b = tf.get_variable('b', [config.vocab_size], initializer=tf.constant_initializer(0.0)) 156 | 157 | logits = tf.matmul(lstm_outputs, W) + b # logits: [batch_size * padded_length, config.vocab_size] 158 | 159 | ###### for inference & validation only ####### 160 | softmax = tf.nn.softmax(logits) 161 | preds = tf.argmax(softmax, 1) 162 | ############################################## 163 | 164 | # for training only below 165 | targets = tf.reshape(target_seqs, [-1]) 166 | weights = tf.to_float(tf.reshape(input_mask, [-1])) 167 | 168 | # Compute losses. 169 | losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, 170 | logits=logits) 171 | batch_loss = tf.div(tf.reduce_sum(tf.multiply(losses, weights)), 172 | tf.reduce_sum(weights), 173 | name="batch_loss") 174 | tf.contrib.losses.add_loss(batch_loss) 175 | total_loss = tf.contrib.losses.get_total_loss() 176 | 177 | # target_cross_entropy_losses = losses # Used in evaluation. 178 | # target_cross_entropy_loss_weights = weights # Used in evaluation. 179 | 180 | return dict( 181 | total_loss = total_loss, 182 | global_step = global_step, 183 | image_feature = image_feature, 184 | input_mask = input_mask, 185 | target_seqs = target_seqs, 186 | input_seqs = input_seqs, 187 | final_state = final_state, 188 | initial_state = initial_state, 189 | softmax = softmax, 190 | preds = preds, 191 | keep_prob = keep_prob, 192 | saver = tf.train.Saver() 193 | ) 194 | 195 | -------------------------------------------------------------------------------- /image2txt model/caption_generator.py: -------------------------------------------------------------------------------- 1 | """Class for generating captions from an image-to-text model. 2 | This is based on Google's https://github.com/tensorflow/models/blob/master/im2txt/im2txt/inference_utils/caption_generator.py 3 | """ 4 | 5 | from __future__ import absolute_import 6 | from __future__ import division 7 | from __future__ import print_function 8 | 9 | import heapq 10 | import math 11 | 12 | import numpy as np 13 | 14 | class Caption(object): 15 | """Represents a complete or partial caption.""" 16 | 17 | def __init__(self, sentence, state, logprob, score, metadata=None): 18 | """Initializes the Caption. 19 | Args: 20 | sentence: List of word ids in the caption. 21 | state: Model state after generating the previous word. 22 | logprob: Log-probability of the caption. 23 | score: Score of the caption. 24 | metadata: Optional metadata associated with the partial sentence. If not 25 | None, a list of strings with the same length as 'sentence'. 26 | """ 27 | self.sentence = sentence 28 | self.state = state 29 | self.logprob = logprob 30 | self.score = score 31 | self.metadata = metadata 32 | 33 | def __cmp__(self, other): 34 | """Compares Captions by score.""" 35 | assert isinstance(other, Caption) 36 | if self.score == other.score: 37 | return 0 38 | elif self.score < other.score: 39 | return -1 40 | else: 41 | return 1 42 | 43 | # For Python 3 compatibility (__cmp__ is deprecated). 44 | def __lt__(self, other): 45 | assert isinstance(other, Caption) 46 | return self.score < other.score 47 | 48 | # Also for Python 3 compatibility. 49 | def __eq__(self, other): 50 | assert isinstance(other, Caption) 51 | return self.score == other.score 52 | 53 | 54 | class TopN(object): 55 | """Maintains the top n elements of an incrementally provided set.""" 56 | 57 | def __init__(self, n): 58 | self._n = n 59 | self._data = [] 60 | 61 | def size(self): 62 | assert self._data is not None 63 | return len(self._data) 64 | 65 | def push(self, x): 66 | """Pushes a new element.""" 67 | assert self._data is not None 68 | if len(self._data) < self._n: 69 | heapq.heappush(self._data, x) 70 | else: 71 | heapq.heappushpop(self._data, x) 72 | 73 | def extract(self, sort=False): 74 | """Extracts all elements from the TopN. This is a destructive operation. 75 | The only method that can be called immediately after extract() is reset(). 76 | Args: 77 | sort: Whether to return the elements in descending sorted order. 78 | Returns: 79 | A list of data; the top n elements provided to the set. 80 | """ 81 | assert self._data is not None 82 | data = self._data 83 | self._data = None 84 | if sort: 85 | data.sort(reverse=True) 86 | return data 87 | 88 | def reset(self): 89 | """Returns the TopN to an empty state.""" 90 | self._data = [] 91 | 92 | 93 | class CaptionGenerator(object): 94 | """Class to generate captions from an image-to-text model.""" 95 | 96 | def __init__(self, 97 | model, 98 | vocab, 99 | beam_size=3, 100 | max_caption_length=24, 101 | length_normalization_factor=0.0): 102 | """Initializes the generator. 103 | Args: 104 | model: Object encapsulating a trained image-to-text model. Must have 105 | methods feed_image() and inference_step(). For example, an instance of 106 | InferenceWrapperBase. 107 | vocab: A Vocabulary object. 108 | beam_size: Beam size to use when generating captions. 109 | max_caption_length: The maximum caption length before stopping the search. 110 | length_normalization_factor: If != 0, a number x such that captions are 111 | scored by logprob/length^x, rather than logprob. This changes the 112 | relative scores of captions depending on their lengths. For example, if 113 | x > 0 then longer captions will be favored. 114 | """ 115 | self.vocab = vocab 116 | self.model = model 117 | 118 | self.beam_size = beam_size 119 | self.max_caption_length = max_caption_length 120 | self.length_normalization_factor = length_normalization_factor 121 | 122 | def _feed_image(self, sess, feature): 123 | # get initial state using image feature 124 | feed_dict = {self.model['image_feature']: feature, 125 | self.model['keep_prob']: 1.0} 126 | state = sess.run(self.model['initial_state'], feed_dict=feed_dict) 127 | return state 128 | 129 | def _inference_step(self, sess, input_feed_list, state_feed_list, max_caption_length): 130 | 131 | mask = np.zeros((1, max_caption_length)) 132 | mask[:, 0] = 1 133 | softmax_outputs = [] 134 | new_state_outputs = [] 135 | 136 | for input, state in zip(input_feed_list, state_feed_list): 137 | feed_dict={self.model['input_seqs']: input, 138 | self.model['initial_state']: state, 139 | self.model['input_mask']: mask, 140 | self.model['keep_prob']: 1.0} 141 | softmax, new_state = sess.run([self.model['softmax'], self.model['final_state']], feed_dict=feed_dict) 142 | softmax_outputs.append(softmax) 143 | new_state_outputs.append(new_state) 144 | 145 | return softmax_outputs, new_state_outputs, None 146 | 147 | def beam_search(self, sess, feature): 148 | """Runs beam search caption generation on a single image. 149 | Args: 150 | sess: TensorFlow Session object. 151 | feature: extracted V3 feature of one image. 152 | Returns: 153 | A list of Caption sorted by descending score. 154 | """ 155 | # Feed in the image to get the initial state. 156 | initial_state = self._feed_image(sess, feature) 157 | 158 | initial_beam = Caption( 159 | sentence=[self.vocab['']], 160 | state=initial_state, 161 | logprob=0.0, 162 | score=0.0, 163 | metadata=[""]) 164 | partial_captions = TopN(self.beam_size) 165 | partial_captions.push(initial_beam) 166 | complete_captions = TopN(self.beam_size) 167 | 168 | # Run beam search. 169 | for _ in range(self.max_caption_length - 1): 170 | partial_captions_list = partial_captions.extract() 171 | partial_captions.reset() 172 | input_feed = [np.array([c.sentence[-1]]).reshape(1, 1) for c in partial_captions_list] 173 | state_feed = [c.state for c in partial_captions_list] 174 | 175 | softmax, new_states, metadata = self._inference_step(sess, 176 | input_feed, 177 | state_feed, 178 | self.max_caption_length) 179 | 180 | for i, partial_caption in enumerate(partial_captions_list): 181 | word_probabilities = softmax[i][0] 182 | state = new_states[i] 183 | # For this partial caption, get the beam_size most probable next words. 184 | words_and_probs = list(enumerate(word_probabilities)) 185 | words_and_probs.sort(key=lambda x: -x[1]) 186 | words_and_probs = words_and_probs[0:self.beam_size] 187 | # Each next word gives a new partial caption. 188 | for w, p in words_and_probs: 189 | if p < 1e-12: 190 | continue # Avoid log(0). 191 | sentence = partial_caption.sentence + [w] 192 | logprob = partial_caption.logprob + math.log(p) 193 | score = logprob 194 | if metadata: 195 | metadata_list = partial_caption.metadata + [metadata[i]] 196 | else: 197 | metadata_list = None 198 | if w == self.vocab['']: 199 | if self.length_normalization_factor > 0: 200 | score /= len(sentence)**self.length_normalization_factor 201 | beam = Caption(sentence, state, logprob, score, metadata_list) 202 | complete_captions.push(beam) 203 | else: 204 | beam = Caption(sentence, state, logprob, score, metadata_list) 205 | partial_captions.push(beam) 206 | if partial_captions.size() == 0: 207 | # We have run out of partial candidates; happens when beam_size = 1. 208 | break 209 | 210 | # If we have no complete captions then fall back to the partial captions. 211 | # But never output a mixture of complete and partial captions because a 212 | # partial caption could have a higher score than all the complete captions. 213 | if not complete_captions.size(): 214 | complete_captions = partial_captions 215 | 216 | return complete_captions.extract(sort=True) -------------------------------------------------------------------------------- /image2txt model/coco_utils.py: -------------------------------------------------------------------------------- 1 | 2 | """Util functions for handling caption data""" 3 | 4 | import os, json 5 | import numpy as np 6 | import h5py 7 | 8 | 9 | def load_coco_data(base_dir='/home/ubuntu/COCO/dataset/COCO_captioning/', 10 | max_train=None): 11 | data = {} 12 | 13 | # loading train&val captions, and train&val image index 14 | caption_file = os.path.join(base_dir, 'coco2014_captions.h5') 15 | with h5py.File(caption_file, 'r') as f: # keys are: train_captions, val_captions, train_image_idxs, val_image_idxs 16 | for k, v in f.items(): 17 | data[k] = np.asarray(v) 18 | 19 | train_feat_file = os.path.join(base_dir, 'train2014_v3_pool_3.npy') 20 | data['train_features'] = np.load(train_feat_file) 21 | 22 | val_feat_file = os.path.join(base_dir, 'val2014_v3_pool_3.npy') 23 | data['val_features'] = np.load(val_feat_file) 24 | 25 | dict_file = os.path.join(base_dir, 'coco2014_vocab.json') 26 | with open(dict_file, 'r') as f: 27 | dict_data = json.load(f) 28 | for k, v in dict_data.items(): 29 | data[k] = v 30 | # convert string to int for the keys 31 | data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()} 32 | 33 | train_url_file = os.path.join(base_dir, 'train2014_urls.txt') 34 | with open(train_url_file, 'r') as f: 35 | train_urls = np.asarray([line.strip() for line in f]) 36 | data['train_urls'] = train_urls 37 | 38 | val_url_file = os.path.join(base_dir, 'val2014_urls.txt') 39 | with open(val_url_file, 'r') as f: 40 | val_urls = np.asarray([line.strip() for line in f]) 41 | data['val_urls'] = val_urls 42 | 43 | # Maybe subsample the training data 44 | if max_train is not None: 45 | num_train = data['train_captions'].shape[0] 46 | mask = np.random.randint(num_train, size=max_train) 47 | data['train_captions'] = data['train_captions'][mask] 48 | data['train_image_idx'] = data['train_image_idx'][mask] 49 | 50 | return data 51 | 52 | 53 | def decode_captions(captions, idx_to_word): 54 | singleton = False 55 | if captions.ndim == 1: 56 | singleton = True 57 | captions = captions[None] 58 | decoded = [] 59 | N, T = captions.shape 60 | for i in range(N): 61 | words = [] 62 | for t in range(T): 63 | word = idx_to_word[captions[i, t]] 64 | if word != '': 65 | words.append(word) 66 | if word == '': 67 | break 68 | decoded.append(' '.join(words)) 69 | if singleton: 70 | decoded = decoded[0] 71 | return decoded 72 | 73 | 74 | def sample_coco_minibatch(data, batch_size=100, split='train'): 75 | split_size = data['%s_captions' % split].shape[0] 76 | mask = np.random.choice(split_size, batch_size) 77 | captions = data['%s_captions' % split][mask] 78 | image_idxs = data['%s_image_idx' % split][mask] 79 | image_features = data['%s_features' % split][image_idxs] 80 | urls = data['%s_urls' % split][image_idxs] 81 | return captions, image_features, urls 82 | 83 | -------------------------------------------------------------------------------- /image2txt model/configuration.py: -------------------------------------------------------------------------------- 1 | 2 | """Image-to-text model and training configurations.""" 3 | 4 | class ModelConfig(object): 5 | """Wrapper class for model hyperparameters.""" 6 | 7 | def __init__(self): 8 | """Sets the default model hyperparameters.""" 9 | 10 | # Number of unique words in the vocab (plus 4, for , , , ) 11 | # This one depends on your chosen vocab size in the preprocessing steps. Normally 12 | # 5,000 might be a good choice since top 5,000 have covered most of the common words 13 | # appear in the data set. The rest not included in the vocab will be used as 14 | self.vocab_size = 5004 15 | 16 | # Batch size. 17 | self.batch_size = 32 18 | 19 | # Scale used to initialize model variables. 20 | self.initializer_scale = 0.08 21 | 22 | # LSTM input and output dimensionality, respectively. 23 | self.image_feature_size = 2048 # equal to output layer size from inception v3 24 | self.num_lstm_units = 512 25 | self.embedding_size = 512 26 | 27 | # If < 1.0, the dropout keep probability applied to LSTM variables. 28 | self.lstm_dropout_keep_prob = 0.7 29 | 30 | # length of each caption after padding 31 | self.padded_length = 25 32 | 33 | # special wording 34 | self._null = 0 35 | self._start = 1 36 | self._end = 2 37 | 38 | class TrainingConfig(object): 39 | """Wrapper class for training hyperparameters.""" 40 | 41 | def __init__(self): 42 | """Sets the default training hyperparameters.""" 43 | # Number of examples per epoch of training data. 44 | #self.num_examples_per_epoch = 586363 45 | self.num_examples_per_epoch = 400000 46 | 47 | # Optimizer for training the model. 48 | self.optimizer = "SGD" # "SGD" 49 | 50 | # Learning rate for the initial phase of training. 51 | self.initial_learning_rate = 2.0 52 | self.learning_rate_decay_factor = 0.5 53 | self.num_epochs_per_decay = 8.0 54 | 55 | # If not None, clip gradients to this value. 56 | self.clip_gradients = 5.0 57 | 58 | self.total_num_epochs = 5 59 | -------------------------------------------------------------------------------- /image2txt model/extract_features.py: -------------------------------------------------------------------------------- 1 | 2 | """Extraction image features using pretrained Inception V3, and save as numpy arrays in local""" 3 | 4 | import argparse 5 | import os.path, os 6 | import re 7 | import sys 8 | import tarfile 9 | 10 | import numpy as np 11 | from six.moves import urllib 12 | import tensorflow as tf 13 | 14 | FLAGS = None 15 | pretrain_model_name = 'classify_image_graph_def.pb' 16 | layer_to_extract = 'pool_3:0' 17 | save_dir = '/home/ubuntu/COCO/dataset/train2014_v3_pool_3' 18 | 19 | MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz' 20 | #MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz' 21 | 22 | def maybe_download_and_extract(): 23 | """Download and extract model tar file.""" 24 | dest_directory = FLAGS.model_dir 25 | if not os.path.exists(dest_directory): 26 | os.makedirs(dest_directory) 27 | filename = MODEL_URL.split('/')[-1] 28 | filepath = os.path.join(dest_directory, filename) 29 | if not os.path.exists(filepath): 30 | def _progress(count, block_size, total_size): 31 | sys.stdout.write('\r>> Downloading %s %.1f%%' % ( 32 | filename, float(count * block_size) / float(total_size) * 100.0)) 33 | sys.stdout.flush() 34 | filepath, _ = urllib.request.urlretrieve(MODEL_URL, filepath, _progress) 35 | print() 36 | statinfo = os.stat(filepath) 37 | print('Successfully downloaded', filename, statinfo.st_size, 'bytes.') 38 | tarfile.open(filepath, 'r:gz').extractall(dest_directory) 39 | 40 | def create_graph(): 41 | """Creates a graph from saved GraphDef file and returns a saver.""" 42 | # Creates graph from saved graph_def.pb. 43 | with tf.gfile.FastGFile(os.path.join( 44 | FLAGS.model_dir, pretrain_model_name), 'rb') as f: 45 | graph_def = tf.GraphDef() 46 | graph_def.ParseFromString(f.read()) 47 | _ = tf.import_graph_def(graph_def, name='') 48 | 49 | def main(_): 50 | """Extract features for all images in image_dir. 51 | Args: 52 | FLAGS.image_dir: The directory where all images are stored. 53 | FLAGS.model_dir: The directory where model file is located. 54 | FLAGS.save_dir: File name of the final array 55 | FLAGS.verbose: Verbose frequency (0 for non-verbose) 56 | Returns: 57 | None 58 | """ 59 | if not os.path.exists(FLAGS.image_dir): 60 | print("image_dir does not exit!") 61 | return None 62 | 63 | # download graph if not exists 64 | maybe_download_and_extract() 65 | 66 | # Creates graph from saved GraphDef. 67 | create_graph() 68 | 69 | with tf.Session() as sess: 70 | # Some useful tensors: 71 | # 'softmax:0': A tensor containing the normalized prediction across 72 | # 1000 labels. 73 | # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 74 | # float description of the image. 75 | # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG 76 | # encoding of the image. 77 | # Runs the softmax tensor by feeding the image_data as input to the graph. 78 | final_array = [] 79 | extract_tensor = sess.graph.get_tensor_by_name(layer_to_extract) 80 | counter = 0 81 | print("There are total " + str(len(os.listdir(FLAGS.image_dir))) + " images to process.") 82 | for img_idx in range(len(os.listdir(FLAGS.image_dir))): 83 | if FLAGS.verbose > 0: 84 | counter += 1 85 | if counter % FLAGS.verbose == 0: 86 | print("Processing images : {0}.jpg".format(img_idx)) 87 | 88 | temp_path = os.path.join(FLAGS.image_dir, '{0}.jpg'.format(img_idx)) 89 | 90 | image_data = tf.gfile.FastGFile(temp_path, 'rb').read() 91 | 92 | predictions = sess.run(extract_tensor, {'DecodeJpeg/contents:0': image_data}) 93 | predictions = np.squeeze(predictions) 94 | 95 | final_array.append(predictions) 96 | 97 | final_array = np.array(final_array) 98 | 99 | np.save(FLAGS.save_dir, final_array) 100 | 101 | print("\n\ndone. Extracted features saved in: ", FLAGS.save_dir) 102 | 103 | if __name__ == '__main__': 104 | parser = argparse.ArgumentParser() 105 | # classify_image_graph_def.pb: 106 | # Binary representation of the GraphDef protocol buffer. 107 | parser.add_argument( 108 | '--model_dir', 109 | type=str, 110 | default='/tmp/imagenet/', 111 | help="""\ 112 | Path to classify_image_graph_def.pb\ 113 | """ 114 | ) 115 | parser.add_argument( 116 | '--image_dir', 117 | type=str, 118 | default='/home/ubuntu/COCO/dataset/train2014/', 119 | help='Absolute path to directory containing images that are to be extracted.' 120 | ) 121 | parser.add_argument( 122 | '--save_dir', 123 | type=str, 124 | default=save_dir, 125 | help='Absolute path where the final array will be saved.' 126 | ) 127 | parser.add_argument( 128 | '--verbose', 129 | type=int, 130 | default=1000, 131 | help='Verbose of processing steps.' 132 | ) 133 | FLAGS, unparsed = parser.parse_known_args() 134 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | -------------------------------------------------------------------------------- /image2txt model/image_utils.py: -------------------------------------------------------------------------------- 1 | 2 | """utils functions for image preprocessing""" 3 | 4 | import urllib.request, urllib.error, urllib.parse, os, tempfile 5 | 6 | import numpy as np 7 | from scipy.misc import imread 8 | 9 | from matplotlib import image 10 | import matplotlib.pyplot as plt 11 | import numpy as np 12 | 13 | #from fast_layers import conv_forward_fast 14 | 15 | 16 | """ 17 | Utility functions used for viewing and processing images. 18 | """ 19 | 20 | 21 | def blur_image(X): 22 | """ 23 | A very gentle image blurring operation, to be used as a regularizer for image 24 | generation. 25 | 26 | Inputs: 27 | - X: Image data of shape (N, 3, H, W) 28 | 29 | Returns: 30 | - X_blur: Blurred version of X, of shape (N, 3, H, W) 31 | """ 32 | w_blur = np.zeros((3, 3, 3, 3)) 33 | b_blur = np.zeros(3) 34 | blur_param = {'stride': 1, 'pad': 1} 35 | for i in range(3): 36 | w_blur[i, i] = np.asarray([[1, 2, 1], [2, 188, 2], [1, 2, 1]], dtype=np.float32) 37 | w_blur /= 200.0 38 | return conv_forward_fast(X, w_blur, b_blur, blur_param)[0] 39 | 40 | 41 | def preprocess_image(img, mean_img, mean='image'): 42 | """ 43 | Convert to float, transepose, and subtract mean pixel 44 | 45 | Input: 46 | - img: (H, W, 3) 47 | 48 | Returns: 49 | - (1, 3, H, 3) 50 | """ 51 | if mean == 'image': 52 | mean = mean_img 53 | elif mean == 'pixel': 54 | mean = mean_img.mean(axis=(1, 2), keepdims=True) 55 | elif mean == 'none': 56 | mean = 0 57 | else: 58 | raise ValueError('mean must be image or pixel or none') 59 | return img.astype(np.float32).transpose(2, 0, 1)[None] - mean 60 | 61 | 62 | def deprocess_image(img, mean_img, mean='image', renorm=False): 63 | """ 64 | Add mean pixel, transpose, and convert to uint8 65 | 66 | Input: 67 | - (1, 3, H, W) or (3, H, W) 68 | 69 | Returns: 70 | - (H, W, 3) 71 | """ 72 | if mean == 'image': 73 | mean = mean_img 74 | elif mean == 'pixel': 75 | mean = mean_img.mean(axis=(1, 2), keepdims=True) 76 | elif mean == 'none': 77 | mean = 0 78 | else: 79 | raise ValueError('mean must be image or pixel or none') 80 | if img.ndim == 3: 81 | img = img[None] 82 | img = (img + mean)[0].transpose(1, 2, 0) 83 | if renorm: 84 | low, high = img.min(), img.max() 85 | img = 255.0 * (img - low) / (high - low) 86 | return img.astype(np.uint8) 87 | 88 | 89 | def image_from_url(url): 90 | """ 91 | Read an image from a URL. Returns a numpy array with the pixel data. 92 | We write the image to a temporary file then read it back. Kinda gross. 93 | """ 94 | try: 95 | f = urllib.request.urlopen(url) 96 | _, fname = tempfile.mkstemp() 97 | with open(fname, 'wb') as ff: 98 | ff.write(f.read()) 99 | img = imread(fname) 100 | #os.remove(fname) 101 | return img 102 | except urllib.error.URLError as e: 103 | print('URL Error: ', e.reason, url) 104 | except urllib.error.HTTPError as e: 105 | print('HTTP Error: ', e.code, url) 106 | 107 | def write_text_on_image(image, image_name, caption): 108 | """ 109 | Write caption onto an image 110 | """ 111 | assert isinstance(image, np.ndarray), "input image must be numpy.ndarray!" 112 | 113 | plt.imshow(image) 114 | plt.axis("off") 115 | plt.title(caption) 116 | plt.savefig(image_name) 117 | plt.close() 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | -------------------------------------------------------------------------------- /image2txt model/inference_on_folder_beam.py: -------------------------------------------------------------------------------- 1 | 2 | """Predict captions on test images using trained model, with beam search method""" 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import tensorflow as tf 9 | 10 | from datetime import datetime 11 | import configuration 12 | from ShowAndTellModel import build_model 13 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions 14 | from image_utils import image_from_url, write_text_on_image 15 | import numpy as np 16 | import scipy.misc 17 | from scipy.misc import imread 18 | import pandas as pd 19 | import os 20 | from six.moves import urllib 21 | import sys 22 | import tarfile 23 | import json 24 | import argparse 25 | from caption_generator import * 26 | 27 | model_config = configuration.ModelConfig() 28 | training_config = configuration.TrainingConfig() 29 | 30 | FLAGS = None 31 | verbose = True 32 | mode = 'inference' 33 | 34 | pretrain_model_name = 'classify_image_graph_def.pb' 35 | layer_to_extract = 'pool_3:0' 36 | MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz' 37 | 38 | def maybe_download_and_extract(): 39 | """Download and extract model tar file.""" 40 | dest_directory = FLAGS.pretrain_dir 41 | if not os.path.exists(dest_directory): 42 | os.makedirs(dest_directory) 43 | filename = MODEL_URL.split('/')[-1] 44 | filepath = os.path.join(dest_directory, filename) 45 | if not os.path.exists(filepath): 46 | def _progress(count, block_size, total_size): 47 | sys.stdout.write('\r>> Downloading %s %.1f%%' % ( 48 | filename, float(count * block_size) / float(total_size) * 100.0)) 49 | sys.stdout.flush() 50 | filepath, _ = urllib.request.urlretrieve(MODEL_URL, filepath, _progress) 51 | print() 52 | statinfo = os.stat(filepath) 53 | print('Successfully downloaded', filename, statinfo.st_size, 'bytes.') 54 | tarfile.open(filepath, 'r:gz').extractall(dest_directory) 55 | 56 | def create_graph(): 57 | """Creates a graph from saved GraphDef file and returns a saver.""" 58 | # Creates graph from saved graph_def.pb. 59 | with tf.gfile.FastGFile(os.path.join( 60 | FLAGS.pretrain_dir, pretrain_model_name), 'rb') as f: 61 | graph_def = tf.GraphDef() 62 | graph_def.ParseFromString(f.read()) 63 | _ = tf.import_graph_def(graph_def, name='') 64 | 65 | def extract_features(image_dir): 66 | 67 | if not os.path.exists(image_dir): 68 | print("image_dir does not exit!") 69 | return None 70 | 71 | maybe_download_and_extract() 72 | 73 | create_graph() 74 | 75 | with tf.Session() as sess: 76 | # Some useful tensors: 77 | # 'softmax:0': A tensor containing the normalized prediction across 78 | # 1000 labels. 79 | # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 80 | # float description of the image. 81 | # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG 82 | # encoding of the image. 83 | # Runs the softmax tensor by feeding the image_data as input to the graph. 84 | final_array = [] 85 | extract_tensor = sess.graph.get_tensor_by_name(layer_to_extract) 86 | counter = 0 87 | print("There are total " + str(len(os.listdir(image_dir))) + " images to process.") 88 | all_image_names = os.listdir(image_dir) 89 | all_image_names = pd.DataFrame({'file_name':all_image_names}) 90 | 91 | for img in all_image_names['file_name'].values: 92 | 93 | temp_path = os.path.join(image_dir, img) 94 | 95 | image_data = tf.gfile.FastGFile(temp_path, 'rb').read() 96 | 97 | predictions = sess.run(extract_tensor, {'DecodeJpeg/contents:0': image_data}) 98 | predictions = np.squeeze(predictions) 99 | 100 | final_array.append(predictions) 101 | 102 | final_array = np.array(final_array) 103 | return final_array, all_image_names 104 | 105 | 106 | def run_inference(sess, features, generator, keep_prob): 107 | 108 | batch_size = features.shape[0] 109 | 110 | final_preds = [] 111 | 112 | for i in range(batch_size): 113 | feature = features[i].reshape(1, -1) 114 | pred = generator.beam_search(sess, feature) 115 | pred = pred[0].sentence 116 | final_preds.append(np.array(pred)) 117 | 118 | return final_preds 119 | 120 | def main(_): 121 | 122 | # load dictionary 123 | data = {} 124 | with open(FLAGS.dict_file, 'r') as f: 125 | dict_data = json.load(f) 126 | for k, v in dict_data.items(): 127 | data[k] = v 128 | data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()} 129 | 130 | # extract all features 131 | features, all_image_names = extract_features(FLAGS.test_dir) 132 | 133 | # Build the TensorFlow graph and train it 134 | g = tf.Graph() 135 | with g.as_default(): 136 | num_of_images = len(os.listdir(FLAGS.test_dir)) 137 | print("Inferencing on {} images".format(num_of_images)) 138 | 139 | # Build the model. 140 | model = build_model(model_config, mode, inference_batch = 1) 141 | 142 | # Initialize beam search Caption Generator 143 | generator = CaptionGenerator(model, data['word_to_idx'], max_caption_length = model_config.padded_length-1) 144 | 145 | # run training 146 | init = tf.global_variables_initializer() 147 | with tf.Session() as sess: 148 | 149 | sess.run(init) 150 | 151 | model['saver'].restore(sess, FLAGS.saved_sess) 152 | 153 | print("Model restored! Last step run: ", sess.run(model['global_step'])) 154 | 155 | # predictions 156 | final_preds = run_inference(sess, features, generator, 1.0) 157 | captions_pred = [unpack.reshape(-1, 1) for unpack in final_preds] 158 | #captions_pred = np.concatenate(captions_pred, 1) 159 | captions_deco= [] 160 | for cap in captions_pred: 161 | dec = decode_captions(cap.reshape(-1, 1), data['idx_to_word']) 162 | dec = ' '.join(dec) 163 | captions_deco.append(dec) 164 | 165 | # saved the images with captions written on them 166 | if not os.path.exists(FLAGS.results_dir): 167 | os.makedirs(FLAGS.results_dir) 168 | for j in range(len(captions_deco)): 169 | this_image_name = all_image_names['file_name'].values[j] 170 | img_name = os.path.join(FLAGS.results_dir, this_image_name) 171 | img = imread(os.path.join(FLAGS.test_dir, this_image_name)) 172 | write_text_on_image(img, img_name, captions_deco[j]) 173 | print("\ndone.") 174 | 175 | if __name__ == '__main__': 176 | parser = argparse.ArgumentParser() 177 | parser.add_argument( 178 | '--pretrain_dir', 179 | type=str, 180 | default= '/tmp/imagenet/', 181 | help="""\ 182 | Path to pretrained model (if not found, will download from web)\ 183 | """ 184 | ) 185 | parser.add_argument( 186 | '--test_dir', 187 | type=str, 188 | default= '/home/ubuntu/COCO/testImages/', 189 | help="""\ 190 | Path to dir of test images to be predicted\ 191 | """ 192 | ) 193 | parser.add_argument( 194 | '--results_dir', 195 | type=str, 196 | default= '/home/ubuntu/COCO/savedTestImages/', 197 | help="""\ 198 | Path to dir of predicted test images\ 199 | """ 200 | ) 201 | parser.add_argument( 202 | '--saved_sess', 203 | type=str, 204 | default= "/home/ubuntu/COCO/savedSession/model0.ckpt", 205 | help="""\ 206 | Path to saved session\ 207 | """ 208 | ) 209 | parser.add_argument( 210 | '--dict_file', 211 | type=str, 212 | default= '/home/ubuntu/COCO/dataset/COCO_captioning/coco2014_vocab.json', 213 | help="""\ 214 | Path to dictionary file\ 215 | """ 216 | ) 217 | FLAGS, unparsed = parser.parse_known_args() 218 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | -------------------------------------------------------------------------------- /image2txt model/inference_on_folder_sample.py: -------------------------------------------------------------------------------- 1 | 2 | """Predict captions on test images using trained model, with greedy sample method""" 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import tensorflow as tf 9 | 10 | from datetime import datetime 11 | import configuration 12 | from ShowAndTellModel import build_model 13 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions 14 | from image_utils import image_from_url, write_text_on_image 15 | import numpy as np 16 | import scipy.misc 17 | from scipy.misc import imread 18 | import pandas as pd 19 | import os 20 | from six.moves import urllib 21 | import sys 22 | import tarfile 23 | import json 24 | import argparse 25 | 26 | model_config = configuration.ModelConfig() 27 | training_config = configuration.TrainingConfig() 28 | 29 | FLAGS = None 30 | verbose = True 31 | mode = 'inference' 32 | 33 | pretrain_model_name = 'classify_image_graph_def.pb' 34 | layer_to_extract = 'pool_3:0' 35 | MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz' 36 | 37 | def maybe_download_and_extract(): 38 | """Download and extract model tar file.""" 39 | dest_directory = FLAGS.pretrain_dir 40 | if not os.path.exists(dest_directory): 41 | os.makedirs(dest_directory) 42 | filename = MODEL_URL.split('/')[-1] 43 | filepath = os.path.join(dest_directory, filename) 44 | if not os.path.exists(filepath): 45 | def _progress(count, block_size, total_size): 46 | sys.stdout.write('\r>> Downloading %s %.1f%%' % ( 47 | filename, float(count * block_size) / float(total_size) * 100.0)) 48 | sys.stdout.flush() 49 | filepath, _ = urllib.request.urlretrieve(MODEL_URL, filepath, _progress) 50 | print() 51 | statinfo = os.stat(filepath) 52 | print('Successfully downloaded', filename, statinfo.st_size, 'bytes.') 53 | tarfile.open(filepath, 'r:gz').extractall(dest_directory) 54 | 55 | def create_graph(): 56 | """Creates a graph from saved GraphDef file and returns a saver.""" 57 | # Creates graph from saved graph_def.pb. 58 | with tf.gfile.FastGFile(os.path.join( 59 | FLAGS.pretrain_dir, pretrain_model_name), 'rb') as f: 60 | graph_def = tf.GraphDef() 61 | graph_def.ParseFromString(f.read()) 62 | _ = tf.import_graph_def(graph_def, name='') 63 | 64 | def extract_features(image_dir): 65 | 66 | if not os.path.exists(image_dir): 67 | print("image_dir does not exit!") 68 | return None 69 | 70 | maybe_download_and_extract() 71 | 72 | create_graph() 73 | 74 | with tf.Session() as sess: 75 | # Some useful tensors: 76 | # 'softmax:0': A tensor containing the normalized prediction across 77 | # 1000 labels. 78 | # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 79 | # float description of the image. 80 | # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG 81 | # encoding of the image. 82 | # Runs the softmax tensor by feeding the image_data as input to the graph. 83 | final_array = [] 84 | extract_tensor = sess.graph.get_tensor_by_name(layer_to_extract) 85 | counter = 0 86 | print("There are total " + str(len(os.listdir(image_dir))) + " images to process.") 87 | all_image_names = os.listdir(image_dir) 88 | all_image_names = pd.DataFrame({'file_name':all_image_names}) 89 | 90 | for img in all_image_names['file_name'].values: 91 | 92 | temp_path = os.path.join(image_dir, img) 93 | 94 | image_data = tf.gfile.FastGFile(temp_path, 'rb').read() 95 | 96 | predictions = sess.run(extract_tensor, {'DecodeJpeg/contents:0': image_data}) 97 | predictions = np.squeeze(predictions) 98 | 99 | final_array.append(predictions) 100 | 101 | final_array = np.array(final_array) 102 | return final_array, all_image_names 103 | 104 | 105 | def step_inference(sess, features, model, keep_prob): 106 | 107 | batch_size = features.shape[0] 108 | 109 | captions_in = np.ones((batch_size, 1)) # token index is one 110 | 111 | state = None 112 | final_preds = [] 113 | current_pred = captions_in 114 | mask = np.zeros((batch_size, model_config.padded_length)) 115 | mask[:, 0] = 1 116 | 117 | # get initial state using image feature 118 | feed_dict = {model['image_feature']: features, 119 | model['keep_prob']: keep_prob} 120 | state = sess.run(model['initial_state'], feed_dict=feed_dict) 121 | 122 | # start to generate sentences 123 | for t in range(model_config.padded_length): 124 | feed_dict={model['input_seqs']: current_pred, 125 | model['initial_state']: state, 126 | model['input_mask']: mask, 127 | model['keep_prob']: keep_prob} 128 | 129 | current_pred, state = sess.run([model['preds'], model['final_state']], feed_dict=feed_dict) 130 | 131 | current_pred = current_pred.reshape(-1, 1) 132 | 133 | final_preds.append(current_pred) 134 | 135 | return final_preds 136 | 137 | def main(_): 138 | 139 | # load dictionary 140 | data = {} 141 | with open(FLAGS.dict_file, 'r') as f: 142 | dict_data = json.load(f) 143 | for k, v in dict_data.items(): 144 | data[k] = v 145 | data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()} 146 | 147 | # extract all features 148 | features, all_image_names = extract_features(FLAGS.test_dir) 149 | 150 | # Build the TensorFlow graph and train it 151 | g = tf.Graph() 152 | with g.as_default(): 153 | num_of_images = len(os.listdir(FLAGS.test_dir)) 154 | print("Inferencing on {} images".format(num_of_images)) 155 | 156 | # Build the model. 157 | model = build_model(model_config, mode, inference_batch = num_of_images) 158 | 159 | # run training 160 | init = tf.global_variables_initializer() 161 | with tf.Session() as sess: 162 | 163 | sess.run(init) 164 | 165 | model['saver'].restore(sess, FLAGS.saved_sess) 166 | 167 | print("Model restored! Last step run: ", sess.run(model['global_step'])) 168 | 169 | # predictions 170 | final_preds = step_inference(sess, features, model, 1.0) 171 | 172 | captions_pred = [unpack.reshape(-1, 1) for unpack in final_preds] 173 | captions_pred = np.concatenate(captions_pred, 1) 174 | captions_deco = decode_captions(captions_pred, data['idx_to_word']) 175 | 176 | # saved the images with captions written on them 177 | if not os.path.exists(FLAGS.results_dir): 178 | os.makedirs(FLAGS.results_dir) 179 | for j in range(len(captions_deco)): 180 | this_image_name = all_image_names['file_name'].values[j] 181 | img_name = os.path.join(FLAGS.results_dir, this_image_name) 182 | img = imread(os.path.join(FLAGS.test_dir, this_image_name)) 183 | write_text_on_image(img, img_name, captions_deco[j]) 184 | print("\ndone.") 185 | 186 | if __name__ == '__main__': 187 | parser = argparse.ArgumentParser() 188 | parser.add_argument( 189 | '--pretrain_dir', 190 | type=str, 191 | default= '/tmp/imagenet/', 192 | help="""\ 193 | Path to pretrained model (if not found, will download from web)\ 194 | """ 195 | ) 196 | parser.add_argument( 197 | '--test_dir', 198 | type=str, 199 | default= '/home/ubuntu/COCO/testImages/', 200 | help="""\ 201 | Path to dir of test images to be predicted\ 202 | """ 203 | ) 204 | parser.add_argument( 205 | '--results_dir', 206 | type=str, 207 | default= '/home/ubuntu/COCO/savedTestImages/', 208 | help="""\ 209 | Path to dir of predicted test images\ 210 | """ 211 | ) 212 | parser.add_argument( 213 | '--saved_sess', 214 | type=str, 215 | default= "/home/ubuntu/COCO/savedSession/model0.ckpt", 216 | help="""\ 217 | Path to saved session\ 218 | """ 219 | ) 220 | parser.add_argument( 221 | '--dict_file', 222 | type=str, 223 | default= '/home/ubuntu/COCO/dataset/COCO_captioning/coco2014_vocab.json', 224 | help="""\ 225 | Path to dictionary file\ 226 | """ 227 | ) 228 | FLAGS, unparsed = parser.parse_known_args() 229 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | -------------------------------------------------------------------------------- /image2txt model/prepare_captions.py: -------------------------------------------------------------------------------- 1 | 2 | """Data preparation for training image captioning model 3 | This script will do the followings: 4 | 5 | 1) Come up with a vocab list by pooling all training and val captions 6 | 2) Convert each word from captions to an integer based on the vocab list 7 | 3) Produce image-name-index mapping, that maps an image to an integer based on its name (e.g. COCO_train2014_000000417432.jpg -> 1) 8 | 4) Rename all images using the image-name-index mapping above 9 | """ 10 | 11 | import json 12 | import os 13 | import collections 14 | import tensorflow as tf 15 | import re 16 | import h5py 17 | import argparse 18 | import sys 19 | import numpy as np 20 | import pandas as pd 21 | 22 | FLAGS = None 23 | BUFFER_TOKENS = ['', '', '', ''] 24 | 25 | def _parse_sentence(s): 26 | s = s.replace('.', '') 27 | s = s.replace(',', '') 28 | s = s.replace('"', '') 29 | s = s.replace("'", '') 30 | s = s.lower() 31 | s = re.sub("\s\s+", " ", s) 32 | s = s.split(' ') 33 | return s 34 | 35 | def preprocess_json_files(path_to_dir): 36 | """Extract captions from each file and combine into lists, as well as image ids, and returned as dict""" 37 | assert os.path.exists(path_to_dir), 'Path to directory of files does not exist!' 38 | results = {} 39 | for file in os.listdir(path_to_dir): 40 | if 'captions_train2014' not in file and 'captions_val2014' not in file: 41 | print("Skipping file {}".format(file)) 42 | continue 43 | temp_path = os.path.join(path_to_dir, file) 44 | with open(temp_path, 'r') as f: 45 | data = json.load(f) 46 | caps = data['annotations'] 47 | images = [item['image_id'] for item in caps] 48 | urls = {} 49 | for img in data['images']: 50 | urls[img['id']] = img['flickr_url'] 51 | caps = [_parse_sentence(item['caption']) for item in caps] 52 | results[file] = (caps, images, urls) 53 | del data 54 | # return dict of each file, having list of captions and image_ids 55 | """ 56 | results is a dict of two files (train and val), each of which has a caps list (results[file1][0]) and a images list (results[file1][1]), and urls dict 57 | (results[file1][2]). cap list is a list of sentences(list of words), images list is a list of image ids(integers), and urls dict is a dict mapping each 58 | image id to its url 59 | """ 60 | return results 61 | 62 | def rename_images(dir, image_id_to_idx): 63 | image_dict = pd.read_csv(image_id_to_idx) # cols: image_idx, image_id 64 | image_dict = image_dict.set_index('image_id') 65 | image_dict = image_dict['image_index'].to_dict() 66 | for img_name in os.listdir(dir): 67 | original_img_path = os.path.join(dir, img_name) 68 | temp_num = int(re.split('\.|_', img_name)[-2]) 69 | temp_num = image_dict[temp_num] # convert image id to idx 70 | new_img_path = os.path.join(dir, '{0}.jpg'.format(temp_num)) 71 | os.rename(original_img_path, new_img_path) 72 | print("Renaming images for folder {} done. ".format(dir)) 73 | 74 | def main(_): 75 | 76 | ## get the vocaboluary 77 | list_of_all_words = None 78 | results = preprocess_json_files(FLAGS.file_dir) 79 | 80 | for k, v in results.items(): 81 | if list_of_all_words is None: 82 | list_of_all_words = results[k][0].copy() 83 | else: 84 | list_of_all_words += results[k][0] 85 | list_of_all_words = [item for sublist in list_of_all_words for item in sublist] 86 | counter = collections.Counter(list_of_all_words) 87 | vocab = counter.most_common(FLAGS.total_vocab) 88 | print("\nVocab generated! Most, median and least frequent words from the vocab are: \n{0}\n{1}\n{2}\n".format(vocab[0], vocab[int(FLAGS.total_vocab/2)], vocab[-1])) 89 | 90 | ## create word_to_idx, and idx_to_word 91 | vocab = [i[0] for i in vocab] 92 | word_to_idx = {} 93 | idx_to_word = {} 94 | # add in BUFFER_TOKENS 95 | for i in range(len(BUFFER_TOKENS)): 96 | idx_to_word[int(i)] = BUFFER_TOKENS[i] 97 | word_to_idx[BUFFER_TOKENS[i]] = i 98 | 99 | for i in range(len(vocab)): 100 | word_to_idx[vocab[i]] = i + len(BUFFER_TOKENS) 101 | idx_to_word[int(i + len(BUFFER_TOKENS))] = vocab[i] 102 | 103 | word_dict = {} 104 | word_dict['idx_to_word'] = idx_to_word 105 | word_dict['word_to_idx'] = word_to_idx 106 | with open(os.path.join(FLAGS.file_dir, 'coco2014_vocab.json'), 'w') as f: 107 | json.dump(word_dict, f) 108 | 109 | ## convert sentences into encoding/integers 110 | # pad all sentence to length of FLAGS.padding_len - 2 111 | def _convert_sentence_to_numbers(s): 112 | """Convert a sentence s (a list of words) to list of numbers using word_to_idx""" 113 | UNK_IDX = BUFFER_TOKENS.index('') 114 | NULL_IDX = BUFFER_TOKENS.index('') 115 | END_IDX = BUFFER_TOKENS.index('') 116 | s_encoded = [word_to_idx.get(w, UNK_IDX) for w in s] 117 | s_encoded += [END_IDX] 118 | s_encoded += [NULL_IDX] * (FLAGS.padding_len - 1 - len(s_encoded)) 119 | return s_encoded 120 | 121 | h = h5py.File(os.path.join(FLAGS.file_dir,'coco2014_captions.h5'), 'w') 122 | for k, _ in results.items(): 123 | results_to_save = {} 124 | all_captions = results[k][0] # list of lists of words 125 | all_images = results[k][1] 126 | all_urls = results[k][2] 127 | all_captions = [_convert_sentence_to_numbers(s) for s in all_captions] # list of numbers 128 | valid_rows = [i for i in range(len(all_captions)) if len(all_captions[i]) == FLAGS.padding_len-1] 129 | all_captions= [row for row in all_captions if len(row) == FLAGS.padding_len-1] 130 | all_captions = np.array(all_captions) 131 | all_images = np.array(all_images) 132 | all_images = all_images[valid_rows] 133 | assert all_images.shape[0] == all_captions.shape[0], "Processing error! all_captions and all_images diff in length." 134 | # concatenate START and END tokens at two sides 135 | START_TOKEN = BUFFER_TOKENS.index('') 136 | #END_TOKEN = BUFFER_TOKENS.index('') 137 | col_start = np.array([START_TOKEN] * all_images.shape[0]).reshape(-1, 1) 138 | #col_end = np.array([END_TOKEN] * all_images.shape[0]).reshape(-1, 1) 139 | all_captions = np.hstack([col_start, all_captions]) 140 | 141 | ## create dicts that maps image ids to 0,...,total_images - image_idx_to_id, image_id_to_idx 142 | image_ids = set(all_images) 143 | image_idx = list(range(len(image_ids))) 144 | image_id_to_idx = {} 145 | image_idx_to_id = {} 146 | for idx, id in enumerate(image_ids): 147 | image_id_to_idx[id] = idx 148 | image_idx_to_id[idx] = id 149 | all_images_idx = np.array([image_id_to_idx.get(id) for id in all_images]) 150 | 151 | ## save all the data 152 | if 'train' in k: 153 | h.create_dataset('train_captions', data=all_captions) 154 | h.create_dataset('train_image_idx', data=all_images_idx) 155 | df = pd.DataFrame.from_dict(image_id_to_idx, 'index') 156 | df['image_id'] = df.index.values 157 | df.columns = ['image_index', 'image_id'] 158 | df.to_csv(os.path.join(FLAGS.file_dir, 'train_image_id_to_idx.csv'), index = False) 159 | 160 | ## write urls file to local as train2014_urls.txt 161 | with open(os.path.join(FLAGS.file_dir, 'train2014_urls.txt'), 'w') as f: 162 | for idx in range(len(image_idx_to_id)): 163 | this_url = all_urls[image_idx_to_id[idx]] 164 | f.write(this_url + '\n') 165 | 166 | elif 'val' in k: 167 | h.create_dataset('val_captions', data=all_captions) 168 | h.create_dataset('val_image_idx', data=all_images_idx) 169 | df = pd.DataFrame.from_dict(image_id_to_idx, 'index') 170 | df['image_id'] = df.index.values 171 | df.columns = ['image_index', 'image_id'] 172 | df.to_csv(os.path.join(FLAGS.file_dir, 'val_image_id_to_idx.csv'), index = False) 173 | 174 | ## write urls file to local as val2014_urls.txt 175 | with open(os.path.join(FLAGS.file_dir, 'val2014_urls.txt'), 'w') as f: 176 | for idx in range(len(image_idx_to_id)): 177 | this_url = all_urls[image_idx_to_id[idx]] 178 | f.write(this_url + '\n') 179 | else: 180 | print("Strange file name found in dir: {0}, \nit does not belong to train nor val, so it is not able to save results!".format(k)) 181 | 182 | h.close() 183 | print("Data generation done.\n Start renaming images in sequence ...") 184 | 185 | if FLAGS.train_image_dir != '': 186 | train_dict = os.path.join(FLAGS.file_dir, 'train_image_id_to_idx.csv') 187 | rename_images(FLAGS.train_image_dir, train_dict) 188 | 189 | if FLAGS.val_image_dir != '': 190 | val_dict = os.path.join(FLAGS.file_dir, 'val_image_id_to_idx.csv') 191 | rename_images(FLAGS.val_image_dir, val_dict) 192 | 193 | print("all done. ") 194 | 195 | if __name__ == '__main__': 196 | parser = argparse.ArgumentParser() 197 | parser.add_argument( 198 | '--file_dir', 199 | type=str, 200 | #default='C:\\Users\\WAWEIMIN\\Google Drive\\ShowAndTellWeimin\\coco_captioning\\original_captioning', 201 | default= '/home/ubuntu/COCO/dataset/COCO_captioning/', 202 | help="""\ 203 | Path to captions_train2014.json, captions_val2014.json\ 204 | """ 205 | ) 206 | parser.add_argument( 207 | '--total_vocab', 208 | type=int, 209 | default=1000, 210 | help='Total number of vacobulary to use.' 211 | ) 212 | parser.add_argument( 213 | '--padding_len', 214 | type=int, 215 | default=17, 216 | help='Total len of padding the sentence.' 217 | ) 218 | parser.add_argument( 219 | '--train_image_dir', 220 | type=str, 221 | default='/home/ubuntu/COCO/dataset/train2014', 222 | help='Absolute path to training dir containing images that are to be renamed.' 223 | ) 224 | parser.add_argument( 225 | '--val_image_dir', 226 | type=str, 227 | default='/home/ubuntu/COCO/dataset/val2014', 228 | help='Absolute path to val dir containing images that are to be renamed.' 229 | ) 230 | FLAGS, unparsed = parser.parse_known_args() 231 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | -------------------------------------------------------------------------------- /image2txt model/prepare_glove_matrix.py: -------------------------------------------------------------------------------- 1 | 2 | """Create word vectors initialization matrix using GloVe vectors""" 3 | 4 | import json 5 | import numpy as np 6 | 7 | TOTAL_VOCAB = 5004 8 | EMBED_DIM = 300 9 | INITIALIZER_SCALE = 0.08 10 | 11 | dict_file = '/home/ubuntu/COCO/dataset/COCO_captioning/coco2014_vocab.json' 12 | glove_file = '/home/ubuntu/COCO/GloVe/glove.42B.300d.txt' 13 | save_glove_mat = '/home/ubuntu/COCO/dataset/COCO_captioning/glove_vocab' 14 | 15 | glove_matrix = np.random.uniform(-INITIALIZER_SCALE, INITIALIZER_SCALE, (TOTAL_VOCAB, EMBED_DIM)) 16 | 17 | with open(dict_file, 'r') as f: 18 | dict_data = json.load(f) 19 | for k, v in dict_data.items(): 20 | data[k] = v 21 | # convert string to int for the keys 22 | data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()} 23 | word_to_idx = data['word_to_idx'] 24 | 25 | total_word_replaced = 0 26 | print_every = 100 27 | with open(glove_file, 'r') as f: 28 | for line in f: 29 | line = line.strip() 30 | word = line.split(' ')[0] 31 | if word in word_to_idx: 32 | total_word_replaced += 1 33 | if total_word_replaced % print_every == 0: 34 | print(total_word_replaced) 35 | 36 | line = line.split(' ')[1:] 37 | word_vec = np.array([float(i) for i in line]) 38 | 39 | glove_matrix[word_to_idx[word]] = word_vec 40 | 41 | if total_word_replaced == TOTAL_VOCAB - 4: 42 | break 43 | 44 | np.save(save_glove_mat, glove_matrix) 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /image2txt model/rename_images_in_sequence.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os.path, os 3 | import re 4 | import sys 5 | import tarfile 6 | 7 | import numpy as np 8 | import pandas as pd 9 | from six.moves import urllib 10 | import tensorflow as tf 11 | import re 12 | 13 | FLAGS = None 14 | 15 | def main(_): 16 | image_dict = pd.read_csv(FLAGS.dict_dir) # cols: image_idx, image_id 17 | image_dict = image_dict.set_index('image_id') 18 | image_dict = image_dict['image_index'].to_dict() 19 | for img_name in os.listdir(FLAGS.image_dir): 20 | original_img_path = os.path.join(FLAGS.image_dir, img_name) 21 | temp_num = int(re.split('\.|_', img_name)[-2]) 22 | temp_num = image_dict[temp_num] # convert image id to idx 23 | new_img_path = os.path.join(FLAGS.image_dir, '{0}.jpg'.format(temp_num)) 24 | os.rename(original_img_path, new_img_path) 25 | print(".done") 26 | 27 | if __name__ == '__main__': 28 | parser = argparse.ArgumentParser() 29 | parser.add_argument( 30 | '--dict_dir', 31 | type=str, 32 | default='/home/ubuntu/COCO/dataset/COCO_captioning/train_image_id_to_idx.csv', 33 | help="""\ 34 | dir that contains train_image_id_to_idx.csv or val_image_id_to_idx.csv\ 35 | """ 36 | ) 37 | parser.add_argument( 38 | '--image_dir', 39 | type=str, 40 | default='/home/ubuntu/COCO/dataset/train2014', 41 | help='Absolute path to directory containing images that are to be extracted.' 42 | ) 43 | FLAGS, unparsed = parser.parse_known_args() 44 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /image2txt model/test.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | from __future__ import absolute_import 4 | from __future__ import division 5 | from __future__ import print_function 6 | 7 | import tensorflow as tf 8 | 9 | from datetime import datetime 10 | import configuration 11 | from ShowAndTellModel import build_model 12 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions 13 | from image_utils import image_from_url, write_text_on_image 14 | import numpy as np 15 | import scipy.misc 16 | 17 | model_config = configuration.ModelConfig() 18 | training_config = configuration.TrainingConfig() 19 | 20 | verbose = True 21 | mode = 'inference' 22 | directory = '/home/ubuntu/COCO/' 23 | 24 | def _step_test(sess, data, batch_size, model, keep_prob): 25 | """ 26 | Make a single gradient update for batch data. 27 | """ 28 | # Make a minibatch of training data 29 | minibatch = sample_coco_minibatch(data, 30 | batch_size=batch_size, 31 | split='val') 32 | captions, features, urls = minibatch 33 | 34 | 35 | # print out ground truth caption 36 | captions_in = captions[:, 0].reshape(-1, 1) 37 | 38 | state = None 39 | final_preds = [] 40 | current_pred = captions_in 41 | mask = np.zeros((batch_size, model_config.padded_length)) 42 | mask[:, 0] = 1 43 | 44 | # get initial state using image feature 45 | feed_dict = {model['image_feature']: features, 46 | model['keep_prob']: keep_prob} 47 | state = sess.run(model['initial_state'], feed_dict=feed_dict) 48 | 49 | # start to generate sentences 50 | for t in range(model_config.padded_length): 51 | feed_dict={model['input_seqs']: current_pred, 52 | model['initial_state']: state, 53 | model['input_mask']: mask, 54 | model['keep_prob']: keep_prob} 55 | 56 | current_pred, state = sess.run([model['preds'], model['final_state']], feed_dict=feed_dict) 57 | 58 | current_pred = current_pred.reshape(-1, 1) 59 | 60 | final_preds.append(current_pred) 61 | 62 | return final_preds, urls 63 | 64 | # load data 65 | data = load_coco_data(base_dir = '/home/ubuntu/COCO/dataset/COCO_captioning/') 66 | 67 | TOTAL_INFERENCE_STEP = 1 68 | BATCH_SIZE_INFERENCE = 32 69 | 70 | # Build the TensorFlow graph and train it 71 | g = tf.Graph() 72 | with g.as_default(): 73 | # Build the model. 74 | model = build_model(model_config, mode, inference_batch = BATCH_SIZE_INFERENCE) 75 | 76 | # run training 77 | init = tf.global_variables_initializer() 78 | with tf.Session() as sess: 79 | 80 | sess.run(init) 81 | 82 | model['saver'].restore(sess, directory + "savedSession/model0.ckpt") 83 | 84 | print("Model restured! Last step run: ", sess.run(model['global_step'])) 85 | 86 | for i in range(TOTAL_INFERENCE_STEP): 87 | captions_pred, urls = _step_test(sess, data, BATCH_SIZE_INFERENCE, model, 1.0) # the output is size (32, 16) 88 | captions_pred = [unpack.reshape(-1, 1) for unpack in captions_pred] 89 | captions_pred = np.concatenate(captions_pred, 1) 90 | 91 | captions_deco = decode_captions(captions_pred, data['idx_to_word']) 92 | 93 | for j in range(len(captions_deco)): 94 | img_name = directory + 'image_' + str(j) + '.jpg' 95 | img = image_from_url(urls[j]) 96 | write_text_on_image(img, img_name, captions_deco[j]) 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | -------------------------------------------------------------------------------- /image2txt model/train.py: -------------------------------------------------------------------------------- 1 | 2 | """Train the model""" 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import tensorflow as tf 9 | 10 | from datetime import datetime 11 | import configuration 12 | from ShowAndTellModel import build_model 13 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions 14 | from image_utils import image_from_url, write_text_on_image 15 | import numpy as np 16 | import os 17 | import sys 18 | import argparse 19 | 20 | model_config = configuration.ModelConfig() 21 | training_config = configuration.TrainingConfig() 22 | 23 | FLAGS = None 24 | savedModelName = 'model1.0.ckpt' 25 | mode = 'train' 26 | 27 | def _run_validation(sess, data, batch_size, model, keep_prob): 28 | """ 29 | Make a single gradient update for batch data. 30 | """ 31 | # Make a minibatch of training data 32 | minibatch = sample_coco_minibatch(data, 33 | batch_size=batch_size, 34 | split='val') 35 | captions, features, urls = minibatch 36 | 37 | captions_in = captions[:, 0].reshape(-1, 1) 38 | 39 | state = None 40 | final_preds = [] 41 | current_pred = captions_in 42 | mask = np.zeros((batch_size, model_config.padded_length)) 43 | mask[:, 0] = 1 44 | 45 | # get initial state using image feature 46 | feed_dict = {model['image_feature']: features, 47 | model['keep_prob']: keep_prob} 48 | state = sess.run(model['initial_state'], feed_dict=feed_dict) 49 | 50 | # start to generate sentences 51 | for t in range(model_config.padded_length): 52 | feed_dict={model['input_seqs']: current_pred, 53 | model['initial_state']: state, 54 | model['input_mask']: mask, 55 | model['keep_prob']: keep_prob} 56 | 57 | current_pred, state = sess.run([model['preds'], model['final_state']], feed_dict=feed_dict) 58 | 59 | current_pred = current_pred.reshape(-1, 1) 60 | 61 | final_preds.append(current_pred) 62 | 63 | return final_preds, urls 64 | 65 | def _step(sess, data, train_op, model, keep_prob): 66 | """ 67 | Make a single gradient update for batch data. 68 | """ 69 | # Make a minibatch of training data 70 | minibatch = sample_coco_minibatch(data, 71 | batch_size=model_config.batch_size, 72 | split='train') 73 | captions, features, urls = minibatch 74 | 75 | captions_in = captions[:, :-1] 76 | captions_out = captions[:, 1:] 77 | 78 | mask = (captions_out != model_config._null) 79 | 80 | _, total_loss_value= sess.run([train_op, model['total_loss']], 81 | feed_dict={model['image_feature']: features, 82 | model['input_seqs']: captions_in, 83 | model['target_seqs']: captions_out, 84 | model['input_mask']: mask, 85 | model['keep_prob']: keep_prob}) 86 | 87 | return total_loss_value 88 | 89 | def main(_): 90 | # load data 91 | data = load_coco_data(FLAGS.data_dir) 92 | 93 | # force padded_length equal to padded_length - 1 94 | # model_config.padded_length = len(data['train_captions'][0]) - 1 95 | 96 | tf.reset_default_graph() 97 | 98 | # Build the TensorFlow graph and train it 99 | g = tf.Graph() 100 | with g.as_default(): 101 | 102 | # Build the model. If FLAGS.glove_vocab is null, we do not initialize the model with word vectors; if not, we initialize with glove vectors 103 | if FLAGS.glove_vocab is '': 104 | model = build_model(model_config, mode=mode) 105 | else: 106 | glove_vocab = np.load(FLAGS.glove_vocab) 107 | model = build_model(model_config, mode=mode, glove_vocab=glove_vocab) 108 | 109 | # Set up the learning rate. 110 | learning_rate_decay_fn = None 111 | learning_rate = tf.constant(training_config.initial_learning_rate) 112 | if training_config.learning_rate_decay_factor > 0: 113 | num_batches_per_epoch = (training_config.num_examples_per_epoch / model_config.batch_size) 114 | decay_steps = int(num_batches_per_epoch * 115 | training_config.num_epochs_per_decay) 116 | 117 | def _learning_rate_decay_fn(learning_rate, global_step): 118 | return tf.train.exponential_decay( 119 | learning_rate, 120 | global_step, 121 | decay_steps=decay_steps, 122 | decay_rate=training_config.learning_rate_decay_factor, 123 | staircase=True) 124 | 125 | learning_rate_decay_fn = _learning_rate_decay_fn 126 | 127 | # Set up the training ops. 128 | train_op = tf.contrib.layers.optimize_loss( 129 | loss=model['total_loss'], 130 | global_step=model['global_step'], 131 | learning_rate=learning_rate, 132 | optimizer=training_config.optimizer, 133 | clip_gradients=training_config.clip_gradients, 134 | learning_rate_decay_fn=learning_rate_decay_fn) 135 | 136 | # initialize all variables 137 | init = tf.global_variables_initializer() 138 | 139 | with tf.Session() as sess: 140 | sess.run(init) 141 | 142 | num_epochs = training_config.total_num_epochs 143 | 144 | num_train = data['train_captions'].shape[0] 145 | iterations_per_epoch = max(num_train / model_config.batch_size, 1) 146 | num_iterations = int(num_epochs * iterations_per_epoch) 147 | 148 | # Set up some variables for book-keeping 149 | epoch = 0 150 | best_val_acc = 0 151 | best_params = {} 152 | loss_history = [] 153 | train_acc_history = [] 154 | val_acc_history = [] 155 | 156 | print("\n\nTotal training iter: ", num_iterations, "\n\n") 157 | time_now = datetime.now() 158 | for t in range(num_iterations): 159 | 160 | total_loss_value = _step(sess, data, train_op, model, model_config.lstm_dropout_keep_prob) # run each training step 161 | 162 | loss_history.append(total_loss_value) 163 | 164 | # Print out training loss 165 | if FLAGS.print_every > 0 and t % FLAGS.print_every == 0: 166 | print('(Iteration %d / %d) loss: %f, and time eclipsed: %.2f minutes' % ( 167 | t + 1, num_iterations, float(loss_history[-1]), (datetime.now() - time_now).seconds/60.0)) 168 | 169 | # Print out some image sample results 170 | if FLAGS.sample_every > 0 and (t+1) % FLAGS.sample_every == 0: 171 | temp_dir = os.path.join(FLAGS.sample_dir, 'temp_dir_{}//'.format(t+1)) 172 | if not os.path.exists(temp_dir): 173 | os.makedirs(temp_dir) 174 | captions_pred, urls = _run_validation(sess, data, model_config.batch_size, model, 1.0) # the output is size (32, 16) 175 | captions_pred = [unpack.reshape(-1, 1) for unpack in captions_pred] 176 | captions_pred = np.concatenate(captions_pred, 1) 177 | 178 | captions_deco = decode_captions(captions_pred, data['idx_to_word']) 179 | 180 | for j in range(len(captions_deco)): 181 | img_name = os.path.join(temp_dir, 'image_{}.jpg'.format(j)) 182 | img = image_from_url(urls[j]) 183 | write_text_on_image(img, img_name, captions_deco[j]) 184 | 185 | # save the model continuously to avoid interruption 186 | if FLAGS.saveModel_every > 0 and (t+1) % FLAGS.saveModel_every == 0: 187 | if not os.path.exists(FLAGS.savedSession_dir): 188 | os.makedirs(FLAGS.savedSession_dir) 189 | checkpoint_name = savedModelName[:-5] + '_checkpoint{}.ckpt'.format(t+1) 190 | save_path = model['saver'].save(sess, os.path.join(FLAGS.savedSession_dir, checkpoint_name)) 191 | 192 | if not os.path.exists(FLAGS.savedSession_dir): 193 | os.makedirs(FLAGS.savedSession_dir) 194 | save_path = model['saver'].save(sess, os.path.join(FLAGS.savedSession_dir, savedModelName)) 195 | print("done. Model saved at: ", os.path.join(FLAGS.savedSession_dir, savedModelName)) 196 | 197 | if __name__ == '__main__': 198 | parser = argparse.ArgumentParser() 199 | # classify_image_graph_def.pb: 200 | # Binary representation of the GraphDef protocol buffer. 201 | parser.add_argument( 202 | '--savedSession_dir', 203 | type=str, 204 | default='/home/ubuntu/COCO/savedSession/', 205 | help="""\ 206 | Directory where your created model / session will be saved.\ 207 | """ 208 | ) 209 | parser.add_argument( 210 | '--data_dir', 211 | type=str, 212 | default='/home/ubuntu/COCO/dataset/COCO_captioning/', 213 | help='Directory where all your training and validation data can be found.' 214 | ) 215 | parser.add_argument( 216 | '--glove_vocab', 217 | type=str, 218 | default='', 219 | help='Directory to glove vocab matrix - glove_vocab.npy - for initialization. Null for not using it. ' 220 | ) 221 | parser.add_argument( 222 | '--sample_dir', 223 | type=str, 224 | default='/home/ubuntu/COCO/progress_sample/', 225 | help='Directory where all intermediate samples will be saved.' 226 | ) 227 | parser.add_argument( 228 | '--print_every', 229 | type=int, 230 | default=50, 231 | help='Num of steps to print your training loss. 0 for not printing/' 232 | ) 233 | parser.add_argument( 234 | '--sample_every', 235 | type=int, 236 | default=5000, 237 | help='Num of steps to generate captions on some validation images. 0 for not validating.' 238 | ) 239 | parser.add_argument( 240 | '--saveModel_every', 241 | type=int, 242 | default=5000, 243 | help='Num of steps to save model checkpoint. 0 for not doing so.' 244 | ) 245 | FLAGS, unparsed = parser.parse_known_args() 246 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 247 | 248 | --------------------------------------------------------------------------------