├── README.md
└── image2txt model
    ├── README.md
    ├── ShowAndTellModel.py
    ├── caption_generator.py
    ├── coco_utils.py
    ├── configuration.py
    ├── extract_features.py
    ├── image_utils.py
    ├── inference_on_folder_beam.py
    ├── inference_on_folder_sample.py
    ├── prepare_captions.py
    ├── prepare_glove_matrix.py
    ├── rename_images_in_sequence.py
    ├── test.py
    └── train.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Image Captioning Model in TensorFlow
 2 | 
 3 | This repo follows the blog post from here: 
 4 | https://vanishingcodes.wordpress.com/2017/03/20/using-tensorflow-to-build-image-to-text-deep-learning-application/
 5 | 
 6 | You can find Image Captioning model implemented using TensorFlow in this repo. The model trains on MSCOCO data set which is downloadable from link: 
 7 | http://mscoco.org/dataset/#download
 8 | 
 9 | The model is a simplified version of Google's ShowAndTell model: https://github.com/tensorflow/models/tree/master/im2txt#prepare-the-training-data
10 | 
11 | Basically, the model will extract all image features and save in numpy arrays to local first, and then build the LSTM to train on those features. However, this turned out to be time saving (at the cost of some accuracy, as we are not fine tuning the image model, the encoder, while training LSTM as opposed to Google's approach)
12 | 
13 | In addition, there are some other differences such as: 
14 | 1. Not using emsembling. 
15 | 2. Not using partially guided training 
16 | 3. Not using BLEU score to monitor validation 
17 | 
18 | ## How to run the scripts
19 | 
20 | 1. Download captions_train2014.json, captions_val2014.json from the link above, as well as train2014(80K) and val2014(40K) images. Save the json files into a folder named ../COCO_captioning/, and train and val images into ../train2014/ and ../val2014/ separately
21 | 
22 | 2. Run command line below to prepare data sets. 
23 | ```shell
24 |     python prepare_captions.py --file_dir /home/ubuntu/COCO/dataset/COCO_captioning/ --total_vocab 2000 --padding_len 25
25 | ```
26 | 3. Run command lines below to extract features using Inception V3 pretrained model, and save them to train2014_v3_pool_3.npy and val2014_v3_pool_3.npy
27 | ```shell
28 |     python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/train2014 --save_dir     /home/ubuntu/COCO/dataset/COCO_captioning/train2014_v3_pool_3 
29 | 
30 |     python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/COCO_captioning/val2014_v3_pool_3 
31 | ```
32 | 4. Build and train the model 
33 | ```shell
34 |     python train.py -–savedSession_dir [dir where your sessions will be saved] –-data_dir [dir where all training data are saved, which was generated in step 2] –-glove_vocab [path to GloVe word vectors, none if not needed] --sample_dir [dir to save all intermediate validation sample images during training] –-print_every [num of steps to print training loss. 0 for not printing] –-sample_every [num of steps to generate captions on validation images. 0 for not sampling] –-saveModel_every [num of steps to save the model checkpoint. 0 for not saving.]
35 | ```
36 | 5. Inference on test data folder - using beam search
37 | ```shell
38 |     python inference_on_folder_beam.py –-pretrain_dir [path to pretrained v3 model; if not found, will download from web] –-test_dir [path to dir of test images you want to inference on] –-results_dir [path to dir where the test results will be saved] –-saved_sess [saved check point] –-dict_file [path to dictionary file generated in step 2, e.g. coco2014_vocab.json]
39 | ```
40 | 


--------------------------------------------------------------------------------
/image2txt model/README.md:
--------------------------------------------------------------------------------
 1 | ## Steps to run
 2 | 
 3 | 1: run prepare_captions.py to get the coco2014_captions.h5 which contains train_captions, train_image_idx, val_captions, val_image_idx, train and val image urls files, as well as two dict files - train_image_id_to_idx.csv, val_image_id_to_idx.csv
 4 | 
 5 | ```shell
 6 | python prepare_captions.py --file_dir /home/ubuntu/COCO/dataset/COCO_captioning/ --total_vocab 2000 --padding_len 25
 7 | ```
 8 | 
 9 | 2: run rename_images_in_sequence.py to change all image names in both train2014 and val2014, using the two dict files generated in previous step. You only need to do this step once and for all!
10 | 
11 | ```shell
12 | python rename_images_in_sequence.py --dict_dir /home/ubuntu/COCO/dataset/COCO_captioning/train_image_id_to_idx.csv --image_dir /home/ubuntu/COCO/dataset/train2014
13 | python rename_images_in_sequence.py --dict_dir /home/ubuntu/COCO/dataset/COCO_captioning/val_image_id_to_idx.csv --image_dir /home/ubuntu/COCO/dataset/val2014
14 | ```
15 | 
16 | 3: extract features! run extract_features.py to extract inception v3 features for each of the train2014 and val2014 images in sequence. You only need to do this step once and for all!
17 | 
18 | ```shell
19 | python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/val2014_v3_pool_3 --verbose 500
20 | python extract_features.py --model_dir /tmp/imagenet --image_dir /home/ubuntu/COCO/dataset/val2014 --save_dir /home/ubuntu/COCO/dataset/val2014_v3_pool_3 --verbose 500
21 | ```
22 | 


--------------------------------------------------------------------------------
/image2txt model/ShowAndTellModel.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """Builds the model.
  3 | 
  4 | Inputs:
  5 |   image_feature
  6 |   input_seqs
  7 |   keep_prob 
  8 |   target_seqs 
  9 |   input_mask 
 10 | Outputs:
 11 |   total_loss 
 12 |   preds 
 13 | """
 14 | 
 15 | import tensorflow as tf
 16 | 
 17 | def build_model(config, mode, inference_batch = None, glove_vocab = None):
 18 | 
 19 |     """Basic setup.
 20 | 
 21 |     Args:
 22 |       config: Object containing configuration parameters.
 23 |       mode: "train" or "inference".
 24 |       inference_batch: if mode is 'inference', we will need to provide the batch_size of input data. Otherwise, leave it as None. 
 25 |       glove_vocab: if we need to use glove word2vec to initialize our vocab embeddings, we will provide with a matrix of [config.vocab_size, config.embedding_size]. If not, we leave it as None. 
 26 |     """
 27 |     assert mode in ["train", "inference"]
 28 |     if mode == 'inference' and inference_batch is None:
 29 |         raise ValueError("When inference mode, inference_batch must be provided!")
 30 |     config = config
 31 | 
 32 |     # To match the "Show and Tell" paper we initialize all variables with a
 33 |     # random uniform initializer.
 34 |     initializer = tf.random_uniform_initializer(
 35 |         minval=-config.initializer_scale,
 36 |         maxval=config.initializer_scale)
 37 | 
 38 |     # An int32 Tensor with shape [batch_size, padded_length].
 39 |     input_seqs = tf.placeholder(tf.int32, [None, None], name='input_seqs')
 40 | 
 41 |     # An int32 Tensor with shape [batch_size, padded_length].
 42 |     target_seqs = tf.placeholder(tf.int32, [None, None], name='target_seqs')    
 43 |     
 44 |     # A float32 Tensor with shape [1]
 45 |     keep_prob = tf.placeholder(tf.float32, name='keep_prob')
 46 | 
 47 |     # An int32 0/1 Tensor with shape [batch_size, padded_length].
 48 |     input_mask = tf.placeholder(tf.int32, [None, None], name='input_mask')
 49 |     
 50 |     # A float32 Tensor with shape [batch_size, image_feature_size].
 51 |     image_feature = tf.placeholder(tf.float32, [None, config.image_feature_size], name='image_feature')
 52 | 
 53 |     # A float32 Tensor with shape [batch_size, padded_length, embedding_size].
 54 |     seq_embedding = None
 55 | 
 56 |     # A float32 scalar Tensor; the total loss for the trainer to optimize.
 57 |     total_loss = None
 58 | 
 59 |     # A float32 Tensor with shape [batch_size * padded_length].
 60 |     target_cross_entropy_losses = None
 61 | 
 62 |     # A float32 Tensor with shape [batch_size * padded_length].
 63 |     target_cross_entropy_loss_weights = None
 64 | 
 65 |     # Collection of variables from the inception submodel.
 66 |     inception_variables = []
 67 | 
 68 |     # Global step Tensor.
 69 |     global_step = None
 70 |     
 71 |     """Sets up the global step Tensor."""
 72 |     global_step = tf.Variable(
 73 |     initial_value=0,
 74 |     name="global_step",
 75 |     trainable=False,
 76 |     collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES])
 77 |     
 78 |     ### Builds the input sequence embeddings ###
 79 |     # Inputs:
 80 |     #   self.input_seqs
 81 |     # Outputs:
 82 |     #   self.seq_embeddings
 83 |     ############################################
 84 | 
 85 |     with tf.variable_scope("seq_embedding"), tf.device("/cpu:0"):
 86 |         if glove_vocab is None:
 87 |             embedding_map = tf.get_variable(
 88 |                 name="map",
 89 |                 shape=[config.vocab_size, config.embedding_size],
 90 |                 initializer=initializer)
 91 |         else:
 92 |             init = tf.constant(glove_vocab.astype('float32'))
 93 |             embedding_map = tf.get_variable(
 94 |                 name="map",
 95 |                 initializer=init)
 96 |         seq_embedding = tf.nn.embedding_lookup(embedding_map, input_seqs)
 97 | 
 98 |     ############ Builds the model ##############
 99 |     # Inputs:
100 |     #   self.image_feature
101 |     #   self.seq_embeddings
102 |     #   self.target_seqs (training and eval only)
103 |     #   self.input_mask (training and eval only)
104 |     # Outputs:
105 |     #   self.total_loss (training and eval only)
106 |     #   self.target_cross_entropy_losses (training and eval only)
107 |     #   self.target_cross_entropy_loss_weights (training and eval only)
108 |     ############################################
109 | 
110 |     lstm_cell = tf.nn.rnn_cell.LSTMCell(
111 |         num_units=config.num_lstm_units, state_is_tuple=True)
112 |         
113 |     lstm_cell = tf.nn.rnn_cell.DropoutWrapper(
114 |         lstm_cell,
115 |         input_keep_prob=keep_prob,
116 |         output_keep_prob=keep_prob)
117 | 
118 |     with tf.variable_scope("lstm", initializer=initializer) as lstm_scope:
119 |     
120 |       # Feed the image embeddings to set the initial LSTM state.
121 |       if mode == 'train':
122 |           zero_state = lstm_cell.zero_state(
123 |               batch_size=config.batch_size, dtype=tf.float32)
124 |       elif mode == 'inference':
125 |           zero_state = lstm_cell.zero_state(
126 |               batch_size=inference_batch, dtype=tf.float32)
127 |               
128 |       with tf.variable_scope('image_embeddings'):
129 |           image_embeddings = tf.contrib.layers.fully_connected(
130 |               inputs=image_feature,
131 |               num_outputs=config.embedding_size,
132 |               activation_fn=None,
133 |               weights_initializer=initializer,
134 |               biases_initializer=None)
135 | 
136 |       _, initial_state = lstm_cell(image_embeddings, zero_state)
137 |       
138 |       # Allow the LSTM variables to be reused.
139 |       lstm_scope.reuse_variables()
140 | 
141 |       # Run the batch of sequence embeddings through the LSTM.
142 |       sequence_length = tf.reduce_sum(input_mask, 1)
143 |       lstm_outputs, final_state = tf.nn.dynamic_rnn(cell=lstm_cell,
144 |                                                     inputs=seq_embedding,
145 |                                                     sequence_length=sequence_length,
146 |                                                     initial_state=initial_state,
147 |                                                     dtype=tf.float32,
148 |                                                     scope=lstm_scope)
149 | 
150 |       # Stack batches vertically.
151 |       lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size]) # output_size == 256
152 |       
153 |     with tf.variable_scope('logits'):
154 |         W = tf.get_variable('W', [lstm_cell.output_size, config.vocab_size], initializer=initializer)
155 |         b = tf.get_variable('b', [config.vocab_size], initializer=tf.constant_initializer(0.0))
156 |         
157 |         logits = tf.matmul(lstm_outputs, W) + b # logits: [batch_size * padded_length, config.vocab_size]
158 |           
159 |     ###### for inference & validation only #######
160 |     softmax = tf.nn.softmax(logits)
161 |     preds = tf.argmax(softmax, 1)
162 |     ##############################################
163 |     
164 |     # for training only below 
165 |     targets = tf.reshape(target_seqs, [-1])
166 |     weights = tf.to_float(tf.reshape(input_mask, [-1]))
167 | 
168 |     # Compute losses.
169 |     losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets,
170 |                                                             logits=logits)
171 |     batch_loss = tf.div(tf.reduce_sum(tf.multiply(losses, weights)),
172 |                         tf.reduce_sum(weights),
173 |                         name="batch_loss")
174 |     tf.contrib.losses.add_loss(batch_loss)
175 |     total_loss = tf.contrib.losses.get_total_loss()
176 |     
177 |     # target_cross_entropy_losses = losses  # Used in evaluation.
178 |     # target_cross_entropy_loss_weights = weights  # Used in evaluation.
179 | 
180 |     return dict(
181 |         total_loss = total_loss, 
182 |         global_step = global_step, 
183 |         image_feature = image_feature, 
184 |         input_mask = input_mask, 
185 |         target_seqs = target_seqs, 
186 |         input_seqs = input_seqs, 
187 |         final_state = final_state,
188 |         initial_state = initial_state, 
189 |         softmax = softmax,
190 |         preds = preds, 
191 |         keep_prob = keep_prob, 
192 |         saver = tf.train.Saver()
193 |     )
194 | 
195 | 


--------------------------------------------------------------------------------
/image2txt model/caption_generator.py:
--------------------------------------------------------------------------------
  1 | """Class for generating captions from an image-to-text model.
  2 |    This is based on Google's https://github.com/tensorflow/models/blob/master/im2txt/im2txt/inference_utils/caption_generator.py
  3 | """
  4 | 
  5 | from __future__ import absolute_import
  6 | from __future__ import division
  7 | from __future__ import print_function
  8 | 
  9 | import heapq
 10 | import math
 11 | 
 12 | import numpy as np
 13 | 
 14 | class Caption(object):
 15 |   """Represents a complete or partial caption."""
 16 | 
 17 |   def __init__(self, sentence, state, logprob, score, metadata=None):
 18 |     """Initializes the Caption.
 19 |     Args:
 20 |       sentence: List of word ids in the caption.
 21 |       state: Model state after generating the previous word.
 22 |       logprob: Log-probability of the caption.
 23 |       score: Score of the caption.
 24 |       metadata: Optional metadata associated with the partial sentence. If not
 25 |         None, a list of strings with the same length as 'sentence'.
 26 |     """
 27 |     self.sentence = sentence
 28 |     self.state = state
 29 |     self.logprob = logprob
 30 |     self.score = score
 31 |     self.metadata = metadata
 32 | 
 33 |   def __cmp__(self, other):
 34 |     """Compares Captions by score."""
 35 |     assert isinstance(other, Caption)
 36 |     if self.score == other.score:
 37 |       return 0
 38 |     elif self.score < other.score:
 39 |       return -1
 40 |     else:
 41 |       return 1
 42 |   
 43 |   # For Python 3 compatibility (__cmp__ is deprecated).
 44 |   def __lt__(self, other):
 45 |     assert isinstance(other, Caption)
 46 |     return self.score < other.score
 47 |   
 48 |   # Also for Python 3 compatibility.
 49 |   def __eq__(self, other):
 50 |     assert isinstance(other, Caption)
 51 |     return self.score == other.score
 52 | 
 53 | 
 54 | class TopN(object):
 55 |   """Maintains the top n elements of an incrementally provided set."""
 56 | 
 57 |   def __init__(self, n):
 58 |     self._n = n
 59 |     self._data = []
 60 | 
 61 |   def size(self):
 62 |     assert self._data is not None
 63 |     return len(self._data)
 64 | 
 65 |   def push(self, x):
 66 |     """Pushes a new element."""
 67 |     assert self._data is not None
 68 |     if len(self._data) < self._n:
 69 |       heapq.heappush(self._data, x)
 70 |     else:
 71 |       heapq.heappushpop(self._data, x)
 72 | 
 73 |   def extract(self, sort=False):
 74 |     """Extracts all elements from the TopN. This is a destructive operation.
 75 |     The only method that can be called immediately after extract() is reset().
 76 |     Args:
 77 |       sort: Whether to return the elements in descending sorted order.
 78 |     Returns:
 79 |       A list of data; the top n elements provided to the set.
 80 |     """
 81 |     assert self._data is not None
 82 |     data = self._data
 83 |     self._data = None
 84 |     if sort:
 85 |       data.sort(reverse=True)
 86 |     return data
 87 | 
 88 |   def reset(self):
 89 |     """Returns the TopN to an empty state."""
 90 |     self._data = []
 91 | 
 92 | 
 93 | class CaptionGenerator(object):
 94 |   """Class to generate captions from an image-to-text model."""
 95 | 
 96 |   def __init__(self,
 97 |                model,
 98 |                vocab,
 99 |                beam_size=3,
100 |                max_caption_length=24,
101 |                length_normalization_factor=0.0):
102 |     """Initializes the generator.
103 |     Args:
104 |       model: Object encapsulating a trained image-to-text model. Must have
105 |         methods feed_image() and inference_step(). For example, an instance of
106 |         InferenceWrapperBase.
107 |       vocab: A Vocabulary object.
108 |       beam_size: Beam size to use when generating captions.
109 |       max_caption_length: The maximum caption length before stopping the search.
110 |       length_normalization_factor: If != 0, a number x such that captions are
111 |         scored by logprob/length^x, rather than logprob. This changes the
112 |         relative scores of captions depending on their lengths. For example, if
113 |         x > 0 then longer captions will be favored.
114 |     """
115 |     self.vocab = vocab
116 |     self.model = model
117 | 
118 |     self.beam_size = beam_size
119 |     self.max_caption_length = max_caption_length
120 |     self.length_normalization_factor = length_normalization_factor
121 |     
122 |   def _feed_image(self, sess, feature):
123 |     # get initial state using image feature 
124 |     feed_dict = {self.model['image_feature']: feature, 
125 |                  self.model['keep_prob']: 1.0}
126 |     state = sess.run(self.model['initial_state'], feed_dict=feed_dict)
127 |     return state
128 |     
129 |   def _inference_step(self, sess, input_feed_list, state_feed_list, max_caption_length):
130 |   
131 |     mask = np.zeros((1, max_caption_length))
132 |     mask[:, 0] = 1
133 |     softmax_outputs = []
134 |     new_state_outputs = []
135 |     
136 |     for input, state in zip(input_feed_list, state_feed_list):
137 |         feed_dict={self.model['input_seqs']: input, 
138 |                    self.model['initial_state']: state, 
139 |                    self.model['input_mask']: mask, 
140 |                    self.model['keep_prob']: 1.0}
141 |         softmax, new_state = sess.run([self.model['softmax'], self.model['final_state']], feed_dict=feed_dict)
142 |         softmax_outputs.append(softmax)
143 |         new_state_outputs.append(new_state)
144 |         
145 |     return softmax_outputs, new_state_outputs, None
146 | 
147 |   def beam_search(self, sess, feature):
148 |     """Runs beam search caption generation on a single image.
149 |     Args:
150 |       sess: TensorFlow Session object.
151 |       feature: extracted V3 feature of one image.
152 |     Returns:
153 |       A list of Caption sorted by descending score.
154 |     """
155 |     # Feed in the image to get the initial state.
156 |     initial_state = self._feed_image(sess, feature)
157 | 
158 |     initial_beam = Caption(
159 |         sentence=[self.vocab['<START>']],
160 |         state=initial_state,
161 |         logprob=0.0,
162 |         score=0.0,
163 |         metadata=[""])
164 |     partial_captions = TopN(self.beam_size)
165 |     partial_captions.push(initial_beam)
166 |     complete_captions = TopN(self.beam_size)
167 | 
168 |     # Run beam search.
169 |     for _ in range(self.max_caption_length - 1):
170 |       partial_captions_list = partial_captions.extract()
171 |       partial_captions.reset()
172 |       input_feed = [np.array([c.sentence[-1]]).reshape(1, 1) for c in partial_captions_list]
173 |       state_feed = [c.state for c in partial_captions_list]
174 | 
175 |       softmax, new_states, metadata = self._inference_step(sess,
176 |                                                            input_feed,
177 |                                                            state_feed, 
178 |                                                            self.max_caption_length)
179 | 
180 |       for i, partial_caption in enumerate(partial_captions_list):
181 |         word_probabilities = softmax[i][0]
182 |         state = new_states[i]
183 |         # For this partial caption, get the beam_size most probable next words.
184 |         words_and_probs = list(enumerate(word_probabilities))
185 |         words_and_probs.sort(key=lambda x: -x[1])
186 |         words_and_probs = words_and_probs[0:self.beam_size]
187 |         # Each next word gives a new partial caption.
188 |         for w, p in words_and_probs:
189 |           if p < 1e-12:
190 |             continue  # Avoid log(0).
191 |           sentence = partial_caption.sentence + [w]
192 |           logprob = partial_caption.logprob + math.log(p)
193 |           score = logprob
194 |           if metadata:
195 |             metadata_list = partial_caption.metadata + [metadata[i]]
196 |           else:
197 |             metadata_list = None
198 |           if w == self.vocab['<END>']:
199 |             if self.length_normalization_factor > 0:
200 |               score /= len(sentence)**self.length_normalization_factor
201 |             beam = Caption(sentence, state, logprob, score, metadata_list)
202 |             complete_captions.push(beam)
203 |           else:
204 |             beam = Caption(sentence, state, logprob, score, metadata_list)
205 |             partial_captions.push(beam)
206 |       if partial_captions.size() == 0:
207 |         # We have run out of partial candidates; happens when beam_size = 1.
208 |         break
209 | 
210 |     # If we have no complete captions then fall back to the partial captions.
211 |     # But never output a mixture of complete and partial captions because a
212 |     # partial caption could have a higher score than all the complete captions.
213 |     if not complete_captions.size():
214 |       complete_captions = partial_captions
215 | 
216 |     return complete_captions.extract(sort=True)


--------------------------------------------------------------------------------
/image2txt model/coco_utils.py:
--------------------------------------------------------------------------------
 1 | 
 2 | """Util functions for handling caption data"""
 3 | 
 4 | import os, json
 5 | import numpy as np
 6 | import h5py
 7 | 
 8 | 
 9 | def load_coco_data(base_dir='/home/ubuntu/COCO/dataset/COCO_captioning/',
10 |                    max_train=None):
11 |   data = {}
12 |   
13 |   # loading train&val captions, and train&val image index 
14 |   caption_file = os.path.join(base_dir, 'coco2014_captions.h5')
15 |   with h5py.File(caption_file, 'r') as f: # keys are: train_captions, val_captions, train_image_idxs, val_image_idxs
16 |     for k, v in f.items():
17 |       data[k] = np.asarray(v)
18 | 
19 |   train_feat_file = os.path.join(base_dir, 'train2014_v3_pool_3.npy')
20 |   data['train_features'] = np.load(train_feat_file)
21 | 
22 |   val_feat_file = os.path.join(base_dir, 'val2014_v3_pool_3.npy')
23 |   data['val_features'] = np.load(val_feat_file)
24 | 
25 |   dict_file = os.path.join(base_dir, 'coco2014_vocab.json')
26 |   with open(dict_file, 'r') as f:
27 |     dict_data = json.load(f)
28 |     for k, v in dict_data.items():
29 |       data[k] = v
30 |   # convert string to int for the keys 
31 |   data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()}
32 | 
33 |   train_url_file = os.path.join(base_dir, 'train2014_urls.txt')
34 |   with open(train_url_file, 'r') as f:
35 |     train_urls = np.asarray([line.strip() for line in f])
36 |   data['train_urls'] = train_urls
37 | 
38 |   val_url_file = os.path.join(base_dir, 'val2014_urls.txt')
39 |   with open(val_url_file, 'r') as f:
40 |     val_urls = np.asarray([line.strip() for line in f])
41 |   data['val_urls'] = val_urls
42 | 
43 |   # Maybe subsample the training data
44 |   if max_train is not None:
45 |     num_train = data['train_captions'].shape[0]
46 |     mask = np.random.randint(num_train, size=max_train)
47 |     data['train_captions'] = data['train_captions'][mask]
48 |     data['train_image_idx'] = data['train_image_idx'][mask]
49 | 
50 |   return data
51 | 
52 | 
53 | def decode_captions(captions, idx_to_word):
54 |   singleton = False
55 |   if captions.ndim == 1:
56 |     singleton = True
57 |     captions = captions[None]
58 |   decoded = []
59 |   N, T = captions.shape
60 |   for i in range(N):
61 |     words = []
62 |     for t in range(T):
63 |       word = idx_to_word[captions[i, t]]
64 |       if word != '<NULL>':
65 |         words.append(word)
66 |       if word == '<END>':
67 |         break
68 |     decoded.append(' '.join(words))
69 |   if singleton:
70 |     decoded = decoded[0]
71 |   return decoded
72 | 
73 | 
74 | def sample_coco_minibatch(data, batch_size=100, split='train'):
75 |   split_size = data['%s_captions' % split].shape[0]
76 |   mask = np.random.choice(split_size, batch_size)
77 |   captions = data['%s_captions' % split][mask]
78 |   image_idxs = data['%s_image_idx' % split][mask]
79 |   image_features = data['%s_features' % split][image_idxs]
80 |   urls = data['%s_urls' % split][image_idxs]
81 |   return captions, image_features, urls
82 | 
83 | 


--------------------------------------------------------------------------------
/image2txt model/configuration.py:
--------------------------------------------------------------------------------
 1 | 
 2 | """Image-to-text model and training configurations."""
 3 | 
 4 | class ModelConfig(object):
 5 |   """Wrapper class for model hyperparameters."""
 6 | 
 7 |   def __init__(self):
 8 |     """Sets the default model hyperparameters."""
 9 | 
10 |     # Number of unique words in the vocab (plus 4, for <NULL>, <START>, <END>, <UNK>)
11 |     # This one depends on your chosen vocab size in the preprocessing steps. Normally 
12 |     # 5,000 might be a good choice since top 5,000 have covered most of the common words
13 |     # appear in the data set. The rest not included in the vocab will be used as <UNK>
14 |     self.vocab_size = 5004
15 |     
16 |     # Batch size.
17 |     self.batch_size = 32
18 | 
19 |     # Scale used to initialize model variables.
20 |     self.initializer_scale = 0.08
21 | 
22 |     # LSTM input and output dimensionality, respectively.
23 |     self.image_feature_size = 2048  # equal to output layer size from inception v3
24 |     self.num_lstm_units = 512
25 |     self.embedding_size = 512
26 | 
27 |     # If < 1.0, the dropout keep probability applied to LSTM variables.
28 |     self.lstm_dropout_keep_prob = 0.7
29 |     
30 |     # length of each caption after padding 
31 |     self.padded_length = 25
32 |     
33 |     # special wording
34 |     self._null = 0 
35 |     self._start = 1 
36 |     self._end = 2
37 | 
38 | class TrainingConfig(object):
39 |   """Wrapper class for training hyperparameters."""
40 | 
41 |   def __init__(self):
42 |     """Sets the default training hyperparameters."""
43 |     # Number of examples per epoch of training data.
44 |     #self.num_examples_per_epoch = 586363
45 |     self.num_examples_per_epoch = 400000
46 | 
47 |     # Optimizer for training the model.
48 |     self.optimizer = "SGD" # "SGD"
49 | 
50 |     # Learning rate for the initial phase of training.
51 |     self.initial_learning_rate = 2.0 
52 |     self.learning_rate_decay_factor = 0.5 
53 |     self.num_epochs_per_decay = 8.0 
54 | 
55 |     # If not None, clip gradients to this value.
56 |     self.clip_gradients = 5.0
57 | 
58 |     self.total_num_epochs = 5
59 | 


--------------------------------------------------------------------------------
/image2txt model/extract_features.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """Extraction image features using pretrained Inception V3, and save as numpy arrays in local"""
  3 | 
  4 | import argparse
  5 | import os.path, os
  6 | import re
  7 | import sys
  8 | import tarfile
  9 | 
 10 | import numpy as np
 11 | from six.moves import urllib
 12 | import tensorflow as tf
 13 | 
 14 | FLAGS = None
 15 | pretrain_model_name = 'classify_image_graph_def.pb'
 16 | layer_to_extract = 'pool_3:0'
 17 | save_dir = '/home/ubuntu/COCO/dataset/train2014_v3_pool_3'
 18 | 
 19 | MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz'
 20 | #MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz'
 21 | 
 22 | def maybe_download_and_extract():
 23 |   """Download and extract model tar file."""
 24 |   dest_directory = FLAGS.model_dir
 25 |   if not os.path.exists(dest_directory):
 26 |     os.makedirs(dest_directory)
 27 |   filename = MODEL_URL.split('/')[-1]
 28 |   filepath = os.path.join(dest_directory, filename)
 29 |   if not os.path.exists(filepath):
 30 |     def _progress(count, block_size, total_size):
 31 |       sys.stdout.write('\r>> Downloading %s %.1f%%' % (
 32 |           filename, float(count * block_size) / float(total_size) * 100.0))
 33 |       sys.stdout.flush()
 34 |     filepath, _ = urllib.request.urlretrieve(MODEL_URL, filepath, _progress)
 35 |     print()
 36 |     statinfo = os.stat(filepath)
 37 |     print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
 38 |   tarfile.open(filepath, 'r:gz').extractall(dest_directory)
 39 | 
 40 | def create_graph():
 41 |   """Creates a graph from saved GraphDef file and returns a saver."""
 42 |   # Creates graph from saved graph_def.pb.
 43 |   with tf.gfile.FastGFile(os.path.join(
 44 |       FLAGS.model_dir, pretrain_model_name), 'rb') as f:
 45 |     graph_def = tf.GraphDef()
 46 |     graph_def.ParseFromString(f.read())
 47 |     _ = tf.import_graph_def(graph_def, name='')
 48 |     
 49 | def main(_):
 50 |   """Extract features for all images in image_dir.
 51 |   Args:
 52 |     FLAGS.image_dir: The directory where all images are stored.
 53 |     FLAGS.model_dir: The directory where model file is located. 
 54 |     FLAGS.save_dir:  File name of the final array 
 55 |     FLAGS.verbose:   Verbose frequency (0 for non-verbose)
 56 |   Returns:
 57 |     None
 58 |   """
 59 |   if not os.path.exists(FLAGS.image_dir):
 60 |       print("image_dir does not exit!")
 61 |       return None
 62 |       
 63 |   # download graph if not exists
 64 |   maybe_download_and_extract()
 65 |   
 66 |   # Creates graph from saved GraphDef.
 67 |   create_graph()
 68 |   
 69 |   with tf.Session() as sess:
 70 |     # Some useful tensors:
 71 |     # 'softmax:0': A tensor containing the normalized prediction across
 72 |     #   1000 labels.
 73 |     # 'pool_3:0': A tensor containing the next-to-last layer containing 2048
 74 |     #   float description of the image.
 75 |     # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
 76 |     #   encoding of the image.
 77 |     # Runs the softmax tensor by feeding the image_data as input to the graph.
 78 |     final_array = []
 79 |     extract_tensor = sess.graph.get_tensor_by_name(layer_to_extract)
 80 |     counter = 0
 81 |     print("There are total " + str(len(os.listdir(FLAGS.image_dir))) + " images to process.")
 82 |     for img_idx in range(len(os.listdir(FLAGS.image_dir))):
 83 |         if FLAGS.verbose > 0:
 84 |             counter += 1 
 85 |             if counter % FLAGS.verbose == 0:
 86 |                 print("Processing images : {0}.jpg".format(img_idx))
 87 |             
 88 |         temp_path = os.path.join(FLAGS.image_dir, '{0}.jpg'.format(img_idx))
 89 |         
 90 |         image_data = tf.gfile.FastGFile(temp_path, 'rb').read()
 91 |         
 92 |         predictions = sess.run(extract_tensor, {'DecodeJpeg/contents:0': image_data})
 93 |         predictions = np.squeeze(predictions)
 94 | 
 95 |         final_array.append(predictions)
 96 | 
 97 |     final_array = np.array(final_array)
 98 | 
 99 |     np.save(FLAGS.save_dir, final_array)
100 |     
101 |     print("\n\ndone. Extracted features saved in: ", FLAGS.save_dir)
102 |     
103 | if __name__ == '__main__':
104 |   parser = argparse.ArgumentParser()
105 |   # classify_image_graph_def.pb:
106 |   #   Binary representation of the GraphDef protocol buffer.
107 |   parser.add_argument(
108 |       '--model_dir',
109 |       type=str,
110 |       default='/tmp/imagenet/',
111 |       help="""\
112 |       Path to classify_image_graph_def.pb\
113 |       """
114 |   )
115 |   parser.add_argument(
116 |       '--image_dir',
117 |       type=str,
118 |       default='/home/ubuntu/COCO/dataset/train2014/',
119 |       help='Absolute path to directory containing images that are to be extracted.'
120 |   )
121 |   parser.add_argument(
122 |       '--save_dir',
123 |       type=str,
124 |       default=save_dir,
125 |       help='Absolute path where the final array will be saved.'
126 |   )
127 |   parser.add_argument(
128 |       '--verbose',
129 |       type=int,
130 |       default=1000,
131 |       help='Verbose of processing steps.'
132 |   )
133 |   FLAGS, unparsed = parser.parse_known_args()
134 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
135 | 
136 | 
137 | 
138 | 
139 | 
140 | 
141 | 
142 | 
143 | 
144 | 
145 | 
146 | 
147 | 
148 | 
149 | 


--------------------------------------------------------------------------------
/image2txt model/image_utils.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """utils functions for image preprocessing"""
  3 | 
  4 | import urllib.request, urllib.error, urllib.parse, os, tempfile
  5 | 
  6 | import numpy as np
  7 | from scipy.misc import imread
  8 | 
  9 | from matplotlib import image
 10 | import matplotlib.pyplot as plt
 11 | import numpy as np 
 12 | 
 13 | #from fast_layers import conv_forward_fast
 14 | 
 15 | 
 16 | """
 17 | Utility functions used for viewing and processing images.
 18 | """
 19 | 
 20 | 
 21 | def blur_image(X):
 22 |   """
 23 |   A very gentle image blurring operation, to be used as a regularizer for image
 24 |   generation.
 25 |   
 26 |   Inputs:
 27 |   - X: Image data of shape (N, 3, H, W)
 28 |   
 29 |   Returns:
 30 |   - X_blur: Blurred version of X, of shape (N, 3, H, W)
 31 |   """
 32 |   w_blur = np.zeros((3, 3, 3, 3))
 33 |   b_blur = np.zeros(3)
 34 |   blur_param = {'stride': 1, 'pad': 1}
 35 |   for i in range(3):
 36 |     w_blur[i, i] = np.asarray([[1, 2, 1], [2, 188, 2], [1, 2, 1]], dtype=np.float32)
 37 |   w_blur /= 200.0
 38 |   return conv_forward_fast(X, w_blur, b_blur, blur_param)[0]
 39 | 
 40 | 
 41 | def preprocess_image(img, mean_img, mean='image'):
 42 |   """
 43 |   Convert to float, transepose, and subtract mean pixel
 44 |   
 45 |   Input:
 46 |   - img: (H, W, 3)
 47 |   
 48 |   Returns:
 49 |   - (1, 3, H, 3)
 50 |   """
 51 |   if mean == 'image':
 52 |     mean = mean_img
 53 |   elif mean == 'pixel':
 54 |     mean = mean_img.mean(axis=(1, 2), keepdims=True)
 55 |   elif mean == 'none':
 56 |     mean = 0
 57 |   else:
 58 |     raise ValueError('mean must be image or pixel or none')
 59 |   return img.astype(np.float32).transpose(2, 0, 1)[None] - mean
 60 | 
 61 | 
 62 | def deprocess_image(img, mean_img, mean='image', renorm=False):
 63 |   """
 64 |   Add mean pixel, transpose, and convert to uint8
 65 |   
 66 |   Input:
 67 |   - (1, 3, H, W) or (3, H, W)
 68 |   
 69 |   Returns:
 70 |   - (H, W, 3)
 71 |   """
 72 |   if mean == 'image':
 73 |     mean = mean_img
 74 |   elif mean == 'pixel':
 75 |     mean = mean_img.mean(axis=(1, 2), keepdims=True)
 76 |   elif mean == 'none':
 77 |     mean = 0
 78 |   else:
 79 |     raise ValueError('mean must be image or pixel or none')
 80 |   if img.ndim == 3:
 81 |     img = img[None]
 82 |   img = (img + mean)[0].transpose(1, 2, 0)
 83 |   if renorm:
 84 |     low, high = img.min(), img.max()
 85 |     img = 255.0 * (img - low) / (high - low)
 86 |   return img.astype(np.uint8)
 87 | 
 88 | 
 89 | def image_from_url(url):
 90 |   """
 91 |   Read an image from a URL. Returns a numpy array with the pixel data.
 92 |   We write the image to a temporary file then read it back. Kinda gross.
 93 |   """
 94 |   try:
 95 |     f = urllib.request.urlopen(url)
 96 |     _, fname = tempfile.mkstemp()
 97 |     with open(fname, 'wb') as ff:
 98 |       ff.write(f.read())
 99 |     img = imread(fname)
100 |     #os.remove(fname)
101 |     return img
102 |   except urllib.error.URLError as e:
103 |     print('URL Error: ', e.reason, url)
104 |   except urllib.error.HTTPError as e:
105 |     print('HTTP Error: ', e.code, url)
106 |     
107 | def write_text_on_image(image, image_name, caption):
108 |   """
109 |   Write caption onto an image 
110 |   """
111 |   assert isinstance(image, np.ndarray), "input image must be numpy.ndarray!"
112 |   
113 |   plt.imshow(image)
114 |   plt.axis("off")
115 |   plt.title(caption)
116 |   plt.savefig(image_name)
117 |   plt.close()
118 | 
119 | 
120 | 
121 | 
122 | 
123 | 
124 | 
125 | 
126 |   
127 |   
128 |   
129 |   
130 | 


--------------------------------------------------------------------------------
/image2txt model/inference_on_folder_beam.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """Predict captions on test images using trained model, with beam search method"""
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | 
  8 | import tensorflow as tf
  9 | 
 10 | from datetime import datetime 
 11 | import configuration
 12 | from ShowAndTellModel import build_model
 13 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
 14 | from image_utils import image_from_url, write_text_on_image
 15 | import numpy as np
 16 | import scipy.misc
 17 | from scipy.misc import imread
 18 | import pandas as pd
 19 | import os
 20 | from six.moves import urllib
 21 | import sys 
 22 | import tarfile
 23 | import json
 24 | import argparse
 25 | from caption_generator import * 
 26 | 
 27 | model_config = configuration.ModelConfig()
 28 | training_config = configuration.TrainingConfig()
 29 | 
 30 | FLAGS = None
 31 | verbose = True
 32 | mode = 'inference'
 33 | 
 34 | pretrain_model_name = 'classify_image_graph_def.pb'
 35 | layer_to_extract = 'pool_3:0'
 36 | MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz'
 37 | 
 38 | def maybe_download_and_extract():
 39 |   """Download and extract model tar file."""
 40 |   dest_directory = FLAGS.pretrain_dir
 41 |   if not os.path.exists(dest_directory):
 42 |     os.makedirs(dest_directory)
 43 |   filename = MODEL_URL.split('/')[-1]
 44 |   filepath = os.path.join(dest_directory, filename)
 45 |   if not os.path.exists(filepath):
 46 |     def _progress(count, block_size, total_size):
 47 |       sys.stdout.write('\r>> Downloading %s %.1f%%' % (
 48 |           filename, float(count * block_size) / float(total_size) * 100.0))
 49 |       sys.stdout.flush()
 50 |     filepath, _ = urllib.request.urlretrieve(MODEL_URL, filepath, _progress)
 51 |     print()
 52 |     statinfo = os.stat(filepath)
 53 |     print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
 54 |   tarfile.open(filepath, 'r:gz').extractall(dest_directory)
 55 | 
 56 | def create_graph():
 57 |   """Creates a graph from saved GraphDef file and returns a saver."""
 58 |   # Creates graph from saved graph_def.pb.
 59 |   with tf.gfile.FastGFile(os.path.join(
 60 |       FLAGS.pretrain_dir, pretrain_model_name), 'rb') as f:
 61 |     graph_def = tf.GraphDef()
 62 |     graph_def.ParseFromString(f.read())
 63 |     _ = tf.import_graph_def(graph_def, name='')
 64 | 
 65 | def extract_features(image_dir):
 66 | 
 67 |     if not os.path.exists(image_dir):
 68 |         print("image_dir does not exit!")
 69 |         return None
 70 | 
 71 |     maybe_download_and_extract()
 72 |     
 73 |     create_graph()
 74 |         
 75 |     with tf.Session() as sess:
 76 |         # Some useful tensors:
 77 |         # 'softmax:0': A tensor containing the normalized prediction across
 78 |         #   1000 labels.
 79 |         # 'pool_3:0': A tensor containing the next-to-last layer containing 2048
 80 |         #   float description of the image.
 81 |         # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
 82 |         #   encoding of the image.
 83 |         # Runs the softmax tensor by feeding the image_data as input to the graph.
 84 |         final_array = []
 85 |         extract_tensor = sess.graph.get_tensor_by_name(layer_to_extract)
 86 |         counter = 0
 87 |         print("There are total " + str(len(os.listdir(image_dir))) + " images to process.")
 88 |         all_image_names = os.listdir(image_dir)
 89 |         all_image_names = pd.DataFrame({'file_name':all_image_names})
 90 |         
 91 |         for img in all_image_names['file_name'].values:
 92 |                 
 93 |             temp_path = os.path.join(image_dir, img)
 94 |             
 95 |             image_data = tf.gfile.FastGFile(temp_path, 'rb').read()
 96 |             
 97 |             predictions = sess.run(extract_tensor, {'DecodeJpeg/contents:0': image_data})
 98 |             predictions = np.squeeze(predictions)
 99 | 
100 |             final_array.append(predictions)
101 | 
102 |         final_array = np.array(final_array)
103 |     return final_array, all_image_names
104 | 
105 | 
106 | def run_inference(sess, features, generator, keep_prob):
107 | 
108 |     batch_size = features.shape[0]
109 | 
110 |     final_preds = []
111 | 
112 |     for i in range(batch_size):
113 |         feature = features[i].reshape(1, -1)
114 |         pred = generator.beam_search(sess, feature)
115 |         pred = pred[0].sentence
116 |         final_preds.append(np.array(pred))
117 |         
118 |     return final_preds
119 | 
120 | def main(_):
121 |     
122 |     # load dictionary 
123 |     data = {}
124 |     with open(FLAGS.dict_file, 'r') as f:
125 |         dict_data = json.load(f)
126 |         for k, v in dict_data.items():
127 |             data[k] = v
128 |     data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()}
129 | 
130 |     # extract all features 
131 |     features, all_image_names = extract_features(FLAGS.test_dir)
132 |     
133 |     # Build the TensorFlow graph and train it
134 |     g = tf.Graph()
135 |     with g.as_default():
136 |         num_of_images = len(os.listdir(FLAGS.test_dir))
137 |         print("Inferencing on {} images".format(num_of_images))
138 |         
139 |         # Build the model.
140 |         model = build_model(model_config, mode, inference_batch = 1)
141 |         
142 |         # Initialize beam search Caption Generator 
143 |         generator = CaptionGenerator(model, data['word_to_idx'], max_caption_length = model_config.padded_length-1)
144 |         
145 |         # run training 
146 |         init = tf.global_variables_initializer()
147 |         with tf.Session() as sess:
148 |         
149 |             sess.run(init)
150 |         
151 |             model['saver'].restore(sess, FLAGS.saved_sess)
152 |               
153 |             print("Model restored! Last step run: ", sess.run(model['global_step']))
154 |             
155 |             # predictions 
156 |             final_preds = run_inference(sess, features, generator, 1.0)
157 |             captions_pred = [unpack.reshape(-1, 1) for unpack in final_preds]
158 |             #captions_pred = np.concatenate(captions_pred, 1)
159 |             captions_deco= []
160 |             for cap in captions_pred:
161 |                 dec = decode_captions(cap.reshape(-1, 1), data['idx_to_word'])
162 |                 dec = ' '.join(dec)
163 |                 captions_deco.append(dec)
164 |             
165 |             # saved the images with captions written on them
166 |             if not os.path.exists(FLAGS.results_dir):
167 |                 os.makedirs(FLAGS.results_dir)
168 |             for j in range(len(captions_deco)):
169 |                 this_image_name = all_image_names['file_name'].values[j]
170 |                 img_name = os.path.join(FLAGS.results_dir, this_image_name)
171 |                 img = imread(os.path.join(FLAGS.test_dir, this_image_name))
172 |                 write_text_on_image(img, img_name, captions_deco[j])
173 |     print("\ndone.")
174 |                
175 | if __name__ == '__main__':
176 |   parser = argparse.ArgumentParser()
177 |   parser.add_argument(
178 |       '--pretrain_dir',
179 |       type=str,
180 |       default= '/tmp/imagenet/',
181 |       help="""\
182 |       Path to pretrained model (if not found, will download from web)\
183 |       """
184 |   )
185 |   parser.add_argument(
186 |       '--test_dir',
187 |       type=str,
188 |       default= '/home/ubuntu/COCO/testImages/', 
189 |       help="""\
190 |       Path to dir of test images to be predicted\
191 |       """
192 |   )
193 |   parser.add_argument(
194 |       '--results_dir',
195 |       type=str,
196 |       default= '/home/ubuntu/COCO/savedTestImages/', 
197 |       help="""\
198 |       Path to dir of predicted test images\
199 |       """
200 |   )
201 |   parser.add_argument(
202 |       '--saved_sess',
203 |       type=str,
204 |       default= "/home/ubuntu/COCO/savedSession/model0.ckpt", 
205 |       help="""\
206 |       Path to saved session\
207 |       """
208 |   )
209 |   parser.add_argument(
210 |       '--dict_file',
211 |       type=str,
212 |       default= '/home/ubuntu/COCO/dataset/COCO_captioning/coco2014_vocab.json', 
213 |       help="""\
214 |       Path to dictionary file\
215 |       """
216 |   )
217 |   FLAGS, unparsed = parser.parse_known_args()
218 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
219 | 
220 |     
221 |   
222 |   
223 |   
224 |   
225 |   
226 |   


--------------------------------------------------------------------------------
/image2txt model/inference_on_folder_sample.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """Predict captions on test images using trained model, with greedy sample method"""
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | 
  8 | import tensorflow as tf
  9 | 
 10 | from datetime import datetime 
 11 | import configuration
 12 | from ShowAndTellModel import build_model
 13 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
 14 | from image_utils import image_from_url, write_text_on_image
 15 | import numpy as np
 16 | import scipy.misc
 17 | from scipy.misc import imread
 18 | import pandas as pd
 19 | import os
 20 | from six.moves import urllib
 21 | import sys 
 22 | import tarfile
 23 | import json
 24 | import argparse
 25 | 
 26 | model_config = configuration.ModelConfig()
 27 | training_config = configuration.TrainingConfig()
 28 | 
 29 | FLAGS = None
 30 | verbose = True
 31 | mode = 'inference'
 32 | 
 33 | pretrain_model_name = 'classify_image_graph_def.pb'
 34 | layer_to_extract = 'pool_3:0'
 35 | MODEL_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz'
 36 | 
 37 | def maybe_download_and_extract():
 38 |   """Download and extract model tar file."""
 39 |   dest_directory = FLAGS.pretrain_dir
 40 |   if not os.path.exists(dest_directory):
 41 |     os.makedirs(dest_directory)
 42 |   filename = MODEL_URL.split('/')[-1]
 43 |   filepath = os.path.join(dest_directory, filename)
 44 |   if not os.path.exists(filepath):
 45 |     def _progress(count, block_size, total_size):
 46 |       sys.stdout.write('\r>> Downloading %s %.1f%%' % (
 47 |           filename, float(count * block_size) / float(total_size) * 100.0))
 48 |       sys.stdout.flush()
 49 |     filepath, _ = urllib.request.urlretrieve(MODEL_URL, filepath, _progress)
 50 |     print()
 51 |     statinfo = os.stat(filepath)
 52 |     print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
 53 |   tarfile.open(filepath, 'r:gz').extractall(dest_directory)
 54 | 
 55 | def create_graph():
 56 |   """Creates a graph from saved GraphDef file and returns a saver."""
 57 |   # Creates graph from saved graph_def.pb.
 58 |   with tf.gfile.FastGFile(os.path.join(
 59 |       FLAGS.pretrain_dir, pretrain_model_name), 'rb') as f:
 60 |     graph_def = tf.GraphDef()
 61 |     graph_def.ParseFromString(f.read())
 62 |     _ = tf.import_graph_def(graph_def, name='')
 63 | 
 64 | def extract_features(image_dir):
 65 | 
 66 |     if not os.path.exists(image_dir):
 67 |         print("image_dir does not exit!")
 68 |         return None
 69 | 
 70 |     maybe_download_and_extract()
 71 |     
 72 |     create_graph()
 73 |         
 74 |     with tf.Session() as sess:
 75 |         # Some useful tensors:
 76 |         # 'softmax:0': A tensor containing the normalized prediction across
 77 |         #   1000 labels.
 78 |         # 'pool_3:0': A tensor containing the next-to-last layer containing 2048
 79 |         #   float description of the image.
 80 |         # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
 81 |         #   encoding of the image.
 82 |         # Runs the softmax tensor by feeding the image_data as input to the graph.
 83 |         final_array = []
 84 |         extract_tensor = sess.graph.get_tensor_by_name(layer_to_extract)
 85 |         counter = 0
 86 |         print("There are total " + str(len(os.listdir(image_dir))) + " images to process.")
 87 |         all_image_names = os.listdir(image_dir)
 88 |         all_image_names = pd.DataFrame({'file_name':all_image_names})
 89 |         
 90 |         for img in all_image_names['file_name'].values:
 91 |                 
 92 |             temp_path = os.path.join(image_dir, img)
 93 |             
 94 |             image_data = tf.gfile.FastGFile(temp_path, 'rb').read()
 95 |             
 96 |             predictions = sess.run(extract_tensor, {'DecodeJpeg/contents:0': image_data})
 97 |             predictions = np.squeeze(predictions)
 98 | 
 99 |             final_array.append(predictions)
100 | 
101 |         final_array = np.array(final_array)
102 |     return final_array, all_image_names
103 | 
104 | 
105 | def step_inference(sess, features, model, keep_prob):
106 | 
107 |     batch_size = features.shape[0]
108 |     
109 |     captions_in = np.ones((batch_size, 1)) # <START> token index is one
110 |     
111 |     state = None 
112 |     final_preds = []
113 |     current_pred = captions_in
114 |     mask = np.zeros((batch_size, model_config.padded_length))
115 |     mask[:, 0] = 1
116 |     
117 |     # get initial state using image feature 
118 |     feed_dict = {model['image_feature']: features, 
119 |                  model['keep_prob']: keep_prob}
120 |     state = sess.run(model['initial_state'], feed_dict=feed_dict)
121 | 
122 |     # start to generate sentences
123 |     for t in range(model_config.padded_length):
124 |         feed_dict={model['input_seqs']: current_pred, 
125 |                    model['initial_state']: state, 
126 |                    model['input_mask']: mask, 
127 |                    model['keep_prob']: keep_prob}
128 |             
129 |         current_pred, state = sess.run([model['preds'], model['final_state']], feed_dict=feed_dict)
130 |         
131 |         current_pred = current_pred.reshape(-1, 1)
132 |         
133 |         final_preds.append(current_pred)
134 |         
135 |     return final_preds    
136 | 
137 | def main(_):
138 |     
139 |     # load dictionary 
140 |     data = {}
141 |     with open(FLAGS.dict_file, 'r') as f:
142 |         dict_data = json.load(f)
143 |         for k, v in dict_data.items():
144 |             data[k] = v
145 |     data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()}
146 | 
147 |     # extract all features 
148 |     features, all_image_names = extract_features(FLAGS.test_dir)
149 |     
150 |     # Build the TensorFlow graph and train it
151 |     g = tf.Graph()
152 |     with g.as_default():
153 |         num_of_images = len(os.listdir(FLAGS.test_dir))
154 |         print("Inferencing on {} images".format(num_of_images))
155 |         
156 |         # Build the model.
157 |         model = build_model(model_config, mode, inference_batch = num_of_images)
158 |         
159 |         # run training 
160 |         init = tf.global_variables_initializer()
161 |         with tf.Session() as sess:
162 |         
163 |             sess.run(init)
164 |         
165 |             model['saver'].restore(sess, FLAGS.saved_sess)
166 |               
167 |             print("Model restored! Last step run: ", sess.run(model['global_step']))
168 |             
169 |             # predictions 
170 |             final_preds = step_inference(sess, features, model, 1.0)
171 | 
172 |             captions_pred = [unpack.reshape(-1, 1) for unpack in final_preds]
173 |             captions_pred = np.concatenate(captions_pred, 1)
174 |             captions_deco = decode_captions(captions_pred, data['idx_to_word'])
175 |             
176 |             # saved the images with captions written on them
177 |             if not os.path.exists(FLAGS.results_dir):
178 |                 os.makedirs(FLAGS.results_dir)
179 |             for j in range(len(captions_deco)):
180 |                 this_image_name = all_image_names['file_name'].values[j]
181 |                 img_name = os.path.join(FLAGS.results_dir, this_image_name)
182 |                 img = imread(os.path.join(FLAGS.test_dir, this_image_name))
183 |                 write_text_on_image(img, img_name, captions_deco[j])
184 |     print("\ndone.")
185 |                
186 | if __name__ == '__main__':
187 |   parser = argparse.ArgumentParser()
188 |   parser.add_argument(
189 |       '--pretrain_dir',
190 |       type=str,
191 |       default= '/tmp/imagenet/',
192 |       help="""\
193 |       Path to pretrained model (if not found, will download from web)\
194 |       """
195 |   )
196 |   parser.add_argument(
197 |       '--test_dir',
198 |       type=str,
199 |       default= '/home/ubuntu/COCO/testImages/', 
200 |       help="""\
201 |       Path to dir of test images to be predicted\
202 |       """
203 |   )
204 |   parser.add_argument(
205 |       '--results_dir',
206 |       type=str,
207 |       default= '/home/ubuntu/COCO/savedTestImages/', 
208 |       help="""\
209 |       Path to dir of predicted test images\
210 |       """
211 |   )
212 |   parser.add_argument(
213 |       '--saved_sess',
214 |       type=str,
215 |       default= "/home/ubuntu/COCO/savedSession/model0.ckpt", 
216 |       help="""\
217 |       Path to saved session\
218 |       """
219 |   )
220 |   parser.add_argument(
221 |       '--dict_file',
222 |       type=str,
223 |       default= '/home/ubuntu/COCO/dataset/COCO_captioning/coco2014_vocab.json', 
224 |       help="""\
225 |       Path to dictionary file\
226 |       """
227 |   )
228 |   FLAGS, unparsed = parser.parse_known_args()
229 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
230 | 
231 |     
232 |   
233 |   
234 |   
235 |   
236 |   
237 |   


--------------------------------------------------------------------------------
/image2txt model/prepare_captions.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """Data preparation for training image captioning model 
  3 | This script will do the followings: 
  4 | 
  5 | 1) Come up with a vocab list by pooling all training and val captions
  6 | 2) Convert each word from captions to an integer based on the vocab list 
  7 | 3) Produce image-name-index mapping, that maps an image to an integer based on its name (e.g. COCO_train2014_000000417432.jpg -> 1)
  8 | 4) Rename all images using the image-name-index mapping above
  9 | """
 10 | 
 11 | import json 
 12 | import os
 13 | import collections
 14 | import tensorflow as tf 
 15 | import re
 16 | import h5py
 17 | import argparse
 18 | import sys 
 19 | import numpy as np 
 20 | import pandas as pd
 21 | 
 22 | FLAGS = None
 23 | BUFFER_TOKENS = ['<NULL>', '<START>', '<END>', '<UNK>']
 24 | 
 25 | def _parse_sentence(s):
 26 |     s = s.replace('.', '')
 27 |     s = s.replace(',', '')
 28 |     s = s.replace('"', '')
 29 |     s = s.replace("'", '')
 30 |     s = s.lower()
 31 |     s = re.sub("\s\s+", " ", s)
 32 |     s = s.split(' ')
 33 |     return s
 34 | 
 35 | def preprocess_json_files(path_to_dir):
 36 |     """Extract captions from each file and combine into lists, as well as image ids, and returned as dict"""
 37 |     assert os.path.exists(path_to_dir), 'Path to directory of files does not exist!'
 38 |     results = {}
 39 |     for file in os.listdir(path_to_dir):
 40 |         if 'captions_train2014' not in file and 'captions_val2014' not in file: 
 41 |             print("Skipping file {}".format(file))
 42 |             continue
 43 |         temp_path = os.path.join(path_to_dir, file)
 44 |         with open(temp_path, 'r') as f:
 45 |             data = json.load(f)
 46 |         caps = data['annotations']
 47 |         images = [item['image_id'] for item in caps]
 48 |         urls = {}
 49 |         for img in data['images']:
 50 |             urls[img['id']] = img['flickr_url']
 51 |         caps = [_parse_sentence(item['caption']) for item in caps]
 52 |         results[file] = (caps, images, urls)
 53 |         del data
 54 |     # return dict of each file, having list of captions and image_ids
 55 |     """
 56 |     results is a dict of two files (train and val), each of which has a caps list (results[file1][0]) and a images list (results[file1][1]), and urls dict
 57 |     (results[file1][2]). cap list is a list of sentences(list of words), images list is a list of image ids(integers), and urls dict is a dict mapping each 
 58 |     image id to its url 
 59 |     """
 60 |     return results
 61 |     
 62 | def rename_images(dir, image_id_to_idx):
 63 |     image_dict = pd.read_csv(image_id_to_idx) # cols: image_idx, image_id
 64 |     image_dict = image_dict.set_index('image_id')
 65 |     image_dict = image_dict['image_index'].to_dict()
 66 |     for img_name in os.listdir(dir):
 67 |         original_img_path = os.path.join(dir, img_name)
 68 |         temp_num = int(re.split('\.|_', img_name)[-2])
 69 |         temp_num = image_dict[temp_num] # convert image id to idx
 70 |         new_img_path = os.path.join(dir, '{0}.jpg'.format(temp_num))
 71 |         os.rename(original_img_path, new_img_path)
 72 |     print("Renaming images for folder {} done. ".format(dir))
 73 | 
 74 | def main(_):
 75 | 
 76 |     ## get the vocaboluary 
 77 |     list_of_all_words = None 
 78 |     results = preprocess_json_files(FLAGS.file_dir)
 79 |     
 80 |     for k, v in results.items():
 81 |         if list_of_all_words is None:
 82 |             list_of_all_words = results[k][0].copy()
 83 |         else:
 84 |             list_of_all_words += results[k][0]
 85 |     list_of_all_words = [item for sublist in list_of_all_words for item in sublist]
 86 |     counter = collections.Counter(list_of_all_words)
 87 |     vocab = counter.most_common(FLAGS.total_vocab)
 88 |     print("\nVocab generated! Most, median and least frequent words from the vocab are: \n{0}\n{1}\n{2}\n".format(vocab[0], vocab[int(FLAGS.total_vocab/2)], vocab[-1]))
 89 |     
 90 |     ## create word_to_idx, and idx_to_word
 91 |     vocab = [i[0] for i in vocab]
 92 |     word_to_idx = {}
 93 |     idx_to_word = {}
 94 |     # add in BUFFER_TOKENS
 95 |     for i in range(len(BUFFER_TOKENS)):
 96 |         idx_to_word[int(i)] = BUFFER_TOKENS[i]
 97 |         word_to_idx[BUFFER_TOKENS[i]] = i
 98 | 
 99 |     for i in range(len(vocab)):
100 |         word_to_idx[vocab[i]] = i + len(BUFFER_TOKENS)
101 |         idx_to_word[int(i + len(BUFFER_TOKENS))] = vocab[i]
102 |         
103 |     word_dict = {}
104 |     word_dict['idx_to_word'] = idx_to_word
105 |     word_dict['word_to_idx'] = word_to_idx
106 |     with open(os.path.join(FLAGS.file_dir, 'coco2014_vocab.json'), 'w') as f:
107 |         json.dump(word_dict, f)
108 |         
109 |     ## convert sentences into encoding/integers
110 |     # pad all sentence to length of FLAGS.padding_len - 2 
111 |     def _convert_sentence_to_numbers(s):
112 |         """Convert a sentence s (a list of words) to list of numbers using word_to_idx"""
113 |         UNK_IDX = BUFFER_TOKENS.index('<UNK>')
114 |         NULL_IDX = BUFFER_TOKENS.index('<NULL>')
115 |         END_IDX = BUFFER_TOKENS.index('<END>')
116 |         s_encoded = [word_to_idx.get(w, UNK_IDX) for w in s]
117 |         s_encoded += [END_IDX]
118 |         s_encoded += [NULL_IDX] * (FLAGS.padding_len - 1 - len(s_encoded))
119 |         return s_encoded
120 |     
121 |     h = h5py.File(os.path.join(FLAGS.file_dir,'coco2014_captions.h5'), 'w')
122 |     for k, _ in results.items():
123 |         results_to_save = {}
124 |         all_captions = results[k][0] # list of lists of words 
125 |         all_images = results[k][1]
126 |         all_urls = results[k][2]
127 |         all_captions = [_convert_sentence_to_numbers(s) for s in all_captions] # list of numbers 
128 |         valid_rows = [i for i in range(len(all_captions)) if len(all_captions[i]) == FLAGS.padding_len-1]
129 |         all_captions= [row for row in all_captions if len(row) == FLAGS.padding_len-1]
130 |         all_captions = np.array(all_captions)
131 |         all_images = np.array(all_images)
132 |         all_images = all_images[valid_rows]
133 |         assert all_images.shape[0] == all_captions.shape[0], "Processing error! all_captions and all_images diff in length."
134 |         # concatenate START and END tokens at two sides 
135 |         START_TOKEN = BUFFER_TOKENS.index('<START>')
136 |         #END_TOKEN = BUFFER_TOKENS.index('<END>')
137 |         col_start = np.array([START_TOKEN] * all_images.shape[0]).reshape(-1, 1)
138 |         #col_end = np.array([END_TOKEN] * all_images.shape[0]).reshape(-1, 1)
139 |         all_captions = np.hstack([col_start, all_captions])
140 |     
141 |         ## create dicts that maps image ids to 0,...,total_images - image_idx_to_id, image_id_to_idx
142 |         image_ids = set(all_images)
143 |         image_idx = list(range(len(image_ids)))
144 |         image_id_to_idx = {}
145 |         image_idx_to_id = {}
146 |         for idx, id in enumerate(image_ids):
147 |             image_id_to_idx[id] = idx
148 |             image_idx_to_id[idx] = id
149 |         all_images_idx = np.array([image_id_to_idx.get(id) for id in all_images])
150 |             
151 |         ## save all the data 
152 |         if 'train' in k:
153 |             h.create_dataset('train_captions', data=all_captions)
154 |             h.create_dataset('train_image_idx', data=all_images_idx)
155 |             df = pd.DataFrame.from_dict(image_id_to_idx, 'index')
156 |             df['image_id'] = df.index.values
157 |             df.columns = ['image_index', 'image_id']
158 |             df.to_csv(os.path.join(FLAGS.file_dir, 'train_image_id_to_idx.csv'), index = False)
159 |             
160 |             ## write urls file to local as train2014_urls.txt
161 |             with open(os.path.join(FLAGS.file_dir, 'train2014_urls.txt'), 'w') as f:
162 |                 for idx in range(len(image_idx_to_id)):
163 |                     this_url = all_urls[image_idx_to_id[idx]]
164 |                     f.write(this_url + '\n')
165 |             
166 |         elif 'val' in k:
167 |             h.create_dataset('val_captions', data=all_captions)
168 |             h.create_dataset('val_image_idx', data=all_images_idx)
169 |             df = pd.DataFrame.from_dict(image_id_to_idx, 'index')
170 |             df['image_id'] = df.index.values
171 |             df.columns = ['image_index', 'image_id']
172 |             df.to_csv(os.path.join(FLAGS.file_dir, 'val_image_id_to_idx.csv'), index = False)
173 |             
174 |             ## write urls file to local as val2014_urls.txt
175 |             with open(os.path.join(FLAGS.file_dir, 'val2014_urls.txt'), 'w') as f:
176 |                 for idx in range(len(image_idx_to_id)):
177 |                     this_url = all_urls[image_idx_to_id[idx]]
178 |                     f.write(this_url + '\n')
179 |         else:
180 |             print("Strange file name found in dir: {0}, \nit does not belong to train nor val, so it is not able to save results!".format(k))
181 |     
182 |     h.close()
183 |     print("Data generation done.\n Start renaming images in sequence ...")
184 | 
185 |     if FLAGS.train_image_dir != '':
186 |         train_dict = os.path.join(FLAGS.file_dir, 'train_image_id_to_idx.csv')
187 |         rename_images(FLAGS.train_image_dir, train_dict)
188 |         
189 |     if FLAGS.val_image_dir != '':
190 |         val_dict = os.path.join(FLAGS.file_dir, 'val_image_id_to_idx.csv')
191 |         rename_images(FLAGS.val_image_dir, val_dict)
192 |         
193 |     print("all done. ")
194 |     
195 | if __name__ == '__main__':
196 |   parser = argparse.ArgumentParser()
197 |   parser.add_argument(
198 |       '--file_dir',
199 |       type=str,
200 |       #default='C:\\Users\\WAWEIMIN\\Google Drive\\ShowAndTellWeimin\\coco_captioning\\original_captioning',
201 |       default= '/home/ubuntu/COCO/dataset/COCO_captioning/', 
202 |       help="""\
203 |       Path to captions_train2014.json, captions_val2014.json\
204 |       """
205 |   )
206 |   parser.add_argument(
207 |       '--total_vocab',
208 |       type=int,
209 |       default=1000,
210 |       help='Total number of vacobulary to use.'
211 |   )
212 |   parser.add_argument(
213 |       '--padding_len',
214 |       type=int,
215 |       default=17,
216 |       help='Total len of padding the sentence.'
217 |   )
218 |   parser.add_argument(
219 |       '--train_image_dir',
220 |       type=str,
221 |       default='/home/ubuntu/COCO/dataset/train2014',
222 |       help='Absolute path to training dir containing images that are to be renamed.'
223 |   )
224 |   parser.add_argument(
225 |       '--val_image_dir',
226 |       type=str,
227 |       default='/home/ubuntu/COCO/dataset/val2014',
228 |       help='Absolute path to val dir containing images that are to be renamed.'
229 |   )
230 |   FLAGS, unparsed = parser.parse_known_args()
231 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
232 | 
233 | 
234 | 
235 | 
236 | 
237 | 
238 | 
239 | 
240 | 
241 | 
242 | 
243 | 
244 | 
245 | 
246 | 
247 | 
248 | 
249 | 
250 | 
251 | 
252 | 
253 | 
254 | 
255 | 
256 | 
257 | 
258 | 
259 | 
260 | 
261 | 
262 | 
263 | 
264 | 
265 | 
266 | 
267 | 
268 | 
269 | 


--------------------------------------------------------------------------------
/image2txt model/prepare_glove_matrix.py:
--------------------------------------------------------------------------------
 1 | 
 2 | """Create word vectors initialization matrix using GloVe vectors"""
 3 | 
 4 | import json
 5 | import numpy as np 
 6 | 
 7 | TOTAL_VOCAB = 5004
 8 | EMBED_DIM = 300
 9 | INITIALIZER_SCALE = 0.08
10 | 
11 | dict_file = '/home/ubuntu/COCO/dataset/COCO_captioning/coco2014_vocab.json'
12 | glove_file = '/home/ubuntu/COCO/GloVe/glove.42B.300d.txt'
13 | save_glove_mat = '/home/ubuntu/COCO/dataset/COCO_captioning/glove_vocab'
14 | 
15 | glove_matrix = np.random.uniform(-INITIALIZER_SCALE, INITIALIZER_SCALE, (TOTAL_VOCAB, EMBED_DIM))
16 | 
17 | with open(dict_file, 'r') as f:
18 |     dict_data = json.load(f)
19 | for k, v in dict_data.items():
20 |     data[k] = v
21 | # convert string to int for the keys 
22 | data['idx_to_word'] = {int(k):v for k, v in data['idx_to_word'].items()}
23 | word_to_idx = data['word_to_idx']
24 | 
25 | total_word_replaced = 0
26 | print_every = 100
27 | with open(glove_file, 'r') as f:
28 |     for line in f:
29 |         line = line.strip()
30 |         word = line.split(' ')[0]
31 |         if word in word_to_idx:
32 |             total_word_replaced += 1
33 |             if total_word_replaced % print_every == 0:
34 |                 print(total_word_replaced)
35 |             
36 |             line = line.split(' ')[1:]
37 |             word_vec = np.array([float(i) for i in line])
38 |             
39 |             glove_matrix[word_to_idx[word]] = word_vec
40 |             
41 |             if total_word_replaced == TOTAL_VOCAB - 4:
42 |                 break 
43 | 
44 | np.save(save_glove_mat, glove_matrix)
45 | 
46 | 
47 | 
48 | 
49 | 
50 | 
51 | 
52 | 
53 | 
54 | 
55 | 
56 | 


--------------------------------------------------------------------------------
/image2txt model/rename_images_in_sequence.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os.path, os
 3 | import re
 4 | import sys
 5 | import tarfile
 6 | 
 7 | import numpy as np
 8 | import pandas as pd 
 9 | from six.moves import urllib
10 | import tensorflow as tf
11 | import re 
12 | 
13 | FLAGS = None
14 | 
15 | def main(_):
16 |     image_dict = pd.read_csv(FLAGS.dict_dir) # cols: image_idx, image_id
17 |     image_dict = image_dict.set_index('image_id')
18 |     image_dict = image_dict['image_index'].to_dict()
19 |     for img_name in os.listdir(FLAGS.image_dir):
20 |         original_img_path = os.path.join(FLAGS.image_dir, img_name)
21 |         temp_num = int(re.split('\.|_', img_name)[-2])
22 |         temp_num = image_dict[temp_num] # convert image id to idx
23 |         new_img_path = os.path.join(FLAGS.image_dir, '{0}.jpg'.format(temp_num))
24 |         os.rename(original_img_path, new_img_path)
25 |     print(".done")
26 | 
27 | if __name__ == '__main__':
28 |   parser = argparse.ArgumentParser()
29 |   parser.add_argument(
30 |       '--dict_dir',
31 |       type=str,
32 |       default='/home/ubuntu/COCO/dataset/COCO_captioning/train_image_id_to_idx.csv',
33 |       help="""\
34 |       dir that contains train_image_id_to_idx.csv or val_image_id_to_idx.csv\
35 |       """
36 |   )
37 |   parser.add_argument(
38 |       '--image_dir',
39 |       type=str,
40 |       default='/home/ubuntu/COCO/dataset/train2014',
41 |       help='Absolute path to directory containing images that are to be extracted.'
42 |   )
43 |   FLAGS, unparsed = parser.parse_known_args()
44 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
45 | 
46 | 
47 | 
48 | 
49 | 
50 | 
51 | 
52 | 
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------
/image2txt model/test.py:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | from __future__ import absolute_import
  4 | from __future__ import division
  5 | from __future__ import print_function
  6 | 
  7 | import tensorflow as tf
  8 | 
  9 | from datetime import datetime 
 10 | import configuration
 11 | from ShowAndTellModel import build_model
 12 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
 13 | from image_utils import image_from_url, write_text_on_image
 14 | import numpy as np
 15 | import scipy.misc
 16 | 
 17 | model_config = configuration.ModelConfig()
 18 | training_config = configuration.TrainingConfig()
 19 | 
 20 | verbose = True
 21 | mode = 'inference'
 22 | directory = '/home/ubuntu/COCO/'
 23 | 
 24 | def _step_test(sess, data, batch_size, model, keep_prob):
 25 |     """
 26 |     Make a single gradient update for batch data. 
 27 |     """
 28 |     # Make a minibatch of training data
 29 |     minibatch = sample_coco_minibatch(data,
 30 |                   batch_size=batch_size,
 31 |                   split='val')
 32 |     captions, features, urls = minibatch
 33 | 
 34 |     
 35 |     # print out ground truth caption
 36 |     captions_in = captions[:, 0].reshape(-1, 1)
 37 |     
 38 |     state = None 
 39 |     final_preds = []
 40 |     current_pred = captions_in
 41 |     mask = np.zeros((batch_size, model_config.padded_length))
 42 |     mask[:, 0] = 1
 43 |     
 44 |     # get initial state using image feature 
 45 |     feed_dict = {model['image_feature']: features, 
 46 |                  model['keep_prob']: keep_prob}
 47 |     state = sess.run(model['initial_state'], feed_dict=feed_dict)
 48 |     
 49 |     # start to generate sentences
 50 |     for t in range(model_config.padded_length):
 51 |         feed_dict={model['input_seqs']: current_pred, 
 52 |                    model['initial_state']: state, 
 53 |                    model['input_mask']: mask, 
 54 |                    model['keep_prob']: keep_prob}
 55 |             
 56 |         current_pred, state = sess.run([model['preds'], model['final_state']], feed_dict=feed_dict)
 57 |         
 58 |         current_pred = current_pred.reshape(-1, 1)
 59 |         
 60 |         final_preds.append(current_pred)
 61 |         
 62 |     return final_preds, urls
 63 |     
 64 | # load data 
 65 | data = load_coco_data(base_dir = '/home/ubuntu/COCO/dataset/COCO_captioning/')
 66 | 
 67 | TOTAL_INFERENCE_STEP = 1
 68 | BATCH_SIZE_INFERENCE = 32
 69 | 
 70 | # Build the TensorFlow graph and train it
 71 | g = tf.Graph()
 72 | with g.as_default():
 73 |     # Build the model.
 74 |     model = build_model(model_config, mode, inference_batch = BATCH_SIZE_INFERENCE)
 75 |     
 76 |     # run training 
 77 |     init = tf.global_variables_initializer()
 78 |     with tf.Session() as sess:
 79 |     
 80 |         sess.run(init)
 81 |     
 82 |         model['saver'].restore(sess, directory + "savedSession/model0.ckpt")
 83 |           
 84 |         print("Model restured! Last step run: ", sess.run(model['global_step']))
 85 |         
 86 |         for i in range(TOTAL_INFERENCE_STEP):
 87 |             captions_pred, urls = _step_test(sess, data, BATCH_SIZE_INFERENCE, model, 1.0) # the output is size (32, 16)
 88 |             captions_pred = [unpack.reshape(-1, 1) for unpack in captions_pred]
 89 |             captions_pred = np.concatenate(captions_pred, 1)
 90 |             
 91 |             captions_deco = decode_captions(captions_pred, data['idx_to_word'])
 92 |             
 93 |             for j in range(len(captions_deco)):
 94 |                 img_name = directory + 'image_' + str(j) + '.jpg'
 95 |                 img = image_from_url(urls[j])
 96 |                 write_text_on_image(img, img_name, captions_deco[j])
 97 |             
 98 |            
 99 |         
100 |   
101 |   
102 |   
103 |   
104 |   
105 |   


--------------------------------------------------------------------------------
/image2txt model/train.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """Train the model"""
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | 
  8 | import tensorflow as tf
  9 | 
 10 | from datetime import datetime 
 11 | import configuration
 12 | from ShowAndTellModel import build_model
 13 | from coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
 14 | from image_utils import image_from_url, write_text_on_image
 15 | import numpy as np
 16 | import os
 17 | import sys
 18 | import argparse
 19 | 
 20 | model_config = configuration.ModelConfig()
 21 | training_config = configuration.TrainingConfig()
 22 | 
 23 | FLAGS = None 
 24 | savedModelName = 'model1.0.ckpt'
 25 | mode = 'train'
 26 | 
 27 | def _run_validation(sess, data, batch_size, model, keep_prob):
 28 |     """
 29 |     Make a single gradient update for batch data. 
 30 |     """
 31 |     # Make a minibatch of training data
 32 |     minibatch = sample_coco_minibatch(data,
 33 |                   batch_size=batch_size,
 34 |                   split='val')
 35 |     captions, features, urls = minibatch
 36 |     
 37 |     captions_in = captions[:, 0].reshape(-1, 1)
 38 |     
 39 |     state = None 
 40 |     final_preds = []
 41 |     current_pred = captions_in
 42 |     mask = np.zeros((batch_size, model_config.padded_length))
 43 |     mask[:, 0] = 1
 44 |     
 45 |     # get initial state using image feature 
 46 |     feed_dict = {model['image_feature']: features, 
 47 |                  model['keep_prob']: keep_prob}
 48 |     state = sess.run(model['initial_state'], feed_dict=feed_dict)
 49 |     
 50 |     # start to generate sentences
 51 |     for t in range(model_config.padded_length):
 52 |         feed_dict={model['input_seqs']: current_pred, 
 53 |                    model['initial_state']: state, 
 54 |                    model['input_mask']: mask, 
 55 |                    model['keep_prob']: keep_prob}
 56 |             
 57 |         current_pred, state = sess.run([model['preds'], model['final_state']], feed_dict=feed_dict)
 58 |         
 59 |         current_pred = current_pred.reshape(-1, 1)
 60 |         
 61 |         final_preds.append(current_pred)
 62 | 
 63 |     return final_preds, urls
 64 | 
 65 | def _step(sess, data, train_op, model, keep_prob):
 66 |     """
 67 |     Make a single gradient update for batch data. 
 68 |     """
 69 |     # Make a minibatch of training data
 70 |     minibatch = sample_coco_minibatch(data,
 71 |                   batch_size=model_config.batch_size,
 72 |                   split='train')
 73 |     captions, features, urls = minibatch
 74 | 
 75 |     captions_in = captions[:, :-1]
 76 |     captions_out = captions[:, 1:]
 77 | 
 78 |     mask = (captions_out != model_config._null)
 79 | 
 80 |     _, total_loss_value= sess.run([train_op, model['total_loss']], 
 81 |                                   feed_dict={model['image_feature']: features, 
 82 |                                              model['input_seqs']: captions_in, 
 83 |                                              model['target_seqs']: captions_out, 
 84 |                                              model['input_mask']: mask, 
 85 |                                              model['keep_prob']: keep_prob})
 86 | 
 87 |     return total_loss_value
 88 |     
 89 | def main(_):
 90 |     # load data 
 91 |     data = load_coco_data(FLAGS.data_dir)
 92 |     
 93 |     # force padded_length equal to padded_length - 1
 94 |     # model_config.padded_length = len(data['train_captions'][0]) - 1
 95 | 
 96 |     tf.reset_default_graph()
 97 |     
 98 |     # Build the TensorFlow graph and train it
 99 |     g = tf.Graph()
100 |     with g.as_default():
101 | 
102 |         # Build the model. If FLAGS.glove_vocab is null, we do not initialize the model with word vectors; if not, we initialize with glove vectors
103 |         if FLAGS.glove_vocab is '':               
104 |             model = build_model(model_config, mode=mode)
105 |         else:
106 |             glove_vocab = np.load(FLAGS.glove_vocab)
107 |             model = build_model(model_config, mode=mode, glove_vocab=glove_vocab)
108 | 
109 |         # Set up the learning rate.
110 |         learning_rate_decay_fn = None
111 |         learning_rate = tf.constant(training_config.initial_learning_rate)
112 |         if training_config.learning_rate_decay_factor > 0:
113 |             num_batches_per_epoch = (training_config.num_examples_per_epoch / model_config.batch_size)
114 |             decay_steps = int(num_batches_per_epoch *
115 |                               training_config.num_epochs_per_decay)
116 | 
117 |             def _learning_rate_decay_fn(learning_rate, global_step):
118 |               return tf.train.exponential_decay(
119 |                   learning_rate,
120 |                   global_step,
121 |                   decay_steps=decay_steps,
122 |                   decay_rate=training_config.learning_rate_decay_factor,
123 |                   staircase=True)
124 | 
125 |             learning_rate_decay_fn = _learning_rate_decay_fn
126 | 
127 |         # Set up the training ops.
128 |         train_op = tf.contrib.layers.optimize_loss(
129 |             loss=model['total_loss'],
130 |             global_step=model['global_step'],
131 |             learning_rate=learning_rate,
132 |             optimizer=training_config.optimizer,
133 |             clip_gradients=training_config.clip_gradients,
134 |             learning_rate_decay_fn=learning_rate_decay_fn)
135 | 
136 |         # initialize all variables 
137 |         init = tf.global_variables_initializer()
138 | 
139 |         with tf.Session() as sess:
140 |             sess.run(init)
141 |             
142 |             num_epochs = training_config.total_num_epochs
143 | 
144 |             num_train = data['train_captions'].shape[0]
145 |             iterations_per_epoch = max(num_train / model_config.batch_size, 1)
146 |             num_iterations = int(num_epochs * iterations_per_epoch)
147 |             
148 |             # Set up some variables for book-keeping
149 |             epoch = 0
150 |             best_val_acc = 0
151 |             best_params = {}
152 |             loss_history = []
153 |             train_acc_history = []
154 |             val_acc_history = []
155 | 
156 |             print("\n\nTotal training iter: ", num_iterations, "\n\n")
157 |             time_now = datetime.now()
158 |             for t in range(num_iterations):
159 |             
160 |                 total_loss_value = _step(sess, data, train_op, model, model_config.lstm_dropout_keep_prob) # run each training step 
161 |                 
162 |                 loss_history.append(total_loss_value)
163 | 
164 |                 # Print out training loss
165 |                 if FLAGS.print_every > 0 and t % FLAGS.print_every == 0:
166 |                     print('(Iteration %d / %d) loss: %f, and time eclipsed: %.2f minutes' % (
167 |                         t + 1, num_iterations, float(loss_history[-1]), (datetime.now() - time_now).seconds/60.0))
168 |                         
169 |                 # Print out some image sample results 
170 |                 if FLAGS.sample_every > 0 and (t+1) % FLAGS.sample_every == 0:
171 |                     temp_dir = os.path.join(FLAGS.sample_dir, 'temp_dir_{}//'.format(t+1))
172 |                     if not os.path.exists(temp_dir):
173 |                         os.makedirs(temp_dir)
174 |                     captions_pred, urls = _run_validation(sess, data, model_config.batch_size, model, 1.0) # the output is size (32, 16)
175 |                     captions_pred = [unpack.reshape(-1, 1) for unpack in captions_pred]
176 |                     captions_pred = np.concatenate(captions_pred, 1)
177 |                     
178 |                     captions_deco = decode_captions(captions_pred, data['idx_to_word'])
179 |                     
180 |                     for j in range(len(captions_deco)):
181 |                         img_name = os.path.join(temp_dir, 'image_{}.jpg'.format(j))
182 |                         img = image_from_url(urls[j])
183 |                         write_text_on_image(img, img_name, captions_deco[j])
184 |                 
185 |                 # save the model continuously to avoid interruption 
186 |                 if FLAGS.saveModel_every > 0 and (t+1) % FLAGS.saveModel_every == 0:
187 |                     if not os.path.exists(FLAGS.savedSession_dir):
188 |                         os.makedirs(FLAGS.savedSession_dir)
189 |                     checkpoint_name = savedModelName[:-5] + '_checkpoint{}.ckpt'.format(t+1)
190 |                     save_path = model['saver'].save(sess, os.path.join(FLAGS.savedSession_dir, checkpoint_name))
191 |                         
192 |             if not os.path.exists(FLAGS.savedSession_dir):
193 |                 os.makedirs(FLAGS.savedSession_dir)
194 |             save_path = model['saver'].save(sess, os.path.join(FLAGS.savedSession_dir, savedModelName))
195 |             print("done. Model saved at: ", os.path.join(FLAGS.savedSession_dir, savedModelName))  
196 |             
197 | if __name__ == '__main__':
198 |   parser = argparse.ArgumentParser()
199 |   # classify_image_graph_def.pb:
200 |   #   Binary representation of the GraphDef protocol buffer.
201 |   parser.add_argument(
202 |       '--savedSession_dir',
203 |       type=str,
204 |       default='/home/ubuntu/COCO/savedSession/',
205 |       help="""\
206 |       Directory where your created model / session will be saved.\
207 |       """
208 |   )
209 |   parser.add_argument(
210 |       '--data_dir',
211 |       type=str,
212 |       default='/home/ubuntu/COCO/dataset/COCO_captioning/',
213 |       help='Directory where all your training and validation data can be found.'
214 |   )
215 |   parser.add_argument(
216 |       '--glove_vocab',
217 |       type=str,
218 |       default='',
219 |       help='Directory to glove vocab matrix - glove_vocab.npy - for initialization. Null for not using it. '
220 |   )
221 |   parser.add_argument(
222 |       '--sample_dir',
223 |       type=str,
224 |       default='/home/ubuntu/COCO/progress_sample/',
225 |       help='Directory where all intermediate samples will be saved.'
226 |   )
227 |   parser.add_argument(
228 |       '--print_every',
229 |       type=int,
230 |       default=50,
231 |       help='Num of steps to print your training loss. 0 for not printing/'
232 |   )
233 |   parser.add_argument(
234 |       '--sample_every',
235 |       type=int,
236 |       default=5000,
237 |       help='Num of steps to generate captions on some validation images. 0 for not validating.'
238 |   )
239 |   parser.add_argument(
240 |       '--saveModel_every',
241 |       type=int,
242 |       default=5000,
243 |       help='Num of steps to save model checkpoint. 0 for not doing so.'
244 |   )
245 |   FLAGS, unparsed = parser.parse_known_args()
246 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
247 | 
248 | 


--------------------------------------------------------------------------------