├── .gitignore ├── README.md ├── __init__.py ├── codes ├── MSCOCO.py ├── __init__.py ├── caption_generator.py ├── evaluate_model.py ├── generate_caption.py ├── generate_caption_beam.py ├── image_reader.py ├── pre_extract_googlenet_features.py ├── prepocess_captions.py ├── sample_code.ipynb ├── sample_code.py ├── sample_code_jp.ipynb └── train_caption_model.py ├── data └── .gitignore ├── download.sh ├── download_jp.sh ├── evalutation_script ├── README.md ├── evalutate_caption_val.py └── generate_caption_val.py ├── experiment1 └── .gitignore ├── images ├── COCO_val2014_000000185546.jpg ├── COCO_val2014_000000192091.jpg ├── COCO_val2014_000000229948.jpg ├── COCO_val2014_000000241747.jpg ├── COCO_val2014_000000250790.jpg ├── COCO_val2014_000000277533.jpg ├── COCO_val2014_000000285505.jpg ├── COCO_val2014_000000323758.jpg ├── COCO_val2014_000000326128.jpg ├── COCO_val2014_000000397427.jpg ├── COCO_val2014_000000553761.jpg └── test_image.jpg ├── models └── .gitignore └── work └── .gitignore /.gitignore: -------------------------------------------------------------------------------- 1 | #gtignore 以外のファイルを全部無視する。 2 | .* 3 | !.gitignore 4 | 5 | codes/sample_code_work.ipynb 6 | 7 | *.pyc 8 | 9 | *.pyc 10 | 11 | codes/image_reader.pyc 12 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### I no longer maintain this repository. This implementation is not that clean and hard to use if you want to train on your own data. I re-implemented from scratch. The new one is much faster, accurate, and clean. It can even generate Chinese captions. Please see the [better implementation] (https://github.com/apple2373/chainer-caption). 2 | 3 | 4 | # image caption generation by chainer 5 | This codes are trying to reproduce the image captioning by google in CVPR 2015. 6 | Show and Tell: A Neural Image Caption Generator 7 | http://arxiv.org/abs/1411.4555 8 | 9 | The training data is MSCOCO. I used GoogleNet to extract images feature in advance (preprocessed them before training), and then trained language model to generate caption. 10 | 11 | I made pre-trained model available. The model achieves CIDEr of 0.66 for the MSCOCO validation dataset. To achieve the better score, the use of beam search is first step (not implemented yet). Also, I think the CNN has to be fine-tuned. 12 | Update: I implemented a beam search. Check the usage below. 13 | 14 | More information including some sample captions are in my blog post. 15 | http://t-satoshi.blogspot.com/2015/12/image-caption-generation-by-cnn-and-lstm.html 16 | 17 | ## requirement 18 | chainer 1.6 http://chainer.org 19 | and some more packages. 20 | !!Warning ** Be sure to use chainer 1.6.** Not the latest version. If you have another version, no guarantee to work. 21 | If you are new, I suggest you to install Anaconda (https://www.continuum.io/downloads) and then install chainer. You can watch the video below. 22 | 23 | ## I have a problem to prepare environment 24 | I prepared a video to show how you prepare environment and generate captions on ubuntu. I used a virtual machine just after installing ubuntu 14.04. If you imitate as in the video, you can generate captions. The process is almost the same for Mac. Windows is not suported because I cannot use it (Acutually chainer does not officialy support windows). 25 | https://drive.google.com/file/d/0B046sNk0DhCDUkpwblZPME1vQzg/edit 26 | Or, some commands that might help: 27 | ``` 28 | #get and install anaconda. you might want to check the latest link. 29 | wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-2.4.1-Linux-x86_64.sh 30 | bash Anaconda2-2.4.1-Linux-x86_64.sh -b 31 | echo 'export PATH=$HOME/anaconda/bin:$PATH' >> .bashrc 32 | echo 'export PYTHONPATH=$HOME/anaconda/lib/python2.7/site-packages:$PYTHONPATH' >> .bashrc 33 | source .bashrc 34 | conda update conda -y 35 | # install chainer 36 | pip install chainer==1.6 37 | ``` 38 | 39 | ## I just want to generate caption! 40 | OK, first, you need to download the models and other preprocessed files. 41 | Then you can generate caption. 42 | 43 | IMPORTANT NOTE: 44 | Google Drive suddenly shut down the hosting service and the file downlaod no longer works. 45 | Ref: https://gsuiteupdates.googleblog.com/2015/08/deprecating-web-hosting-support-in.html 46 | 47 | I don't have time to uplaod somewhere else, but all files are here: 48 | https://drive.google.com/open?id=0B046sNk0DhCDeEczcm1vaWlCTFk 49 | 50 | ``` 51 | bash download.sh 52 | cd codes 53 | python generate_caption.py -i ../images/test_image.jpg 54 | ``` 55 | This generate a caption for ../images/test_image.jpg. If you want to use your image, you just have to indicate -i option to image that you want to generate captions. 56 | 57 | Once you set up environment, you can use it as a module.Check the ipython notebooks. This includes beam search. 58 | English:https://github.com/apple2373/chainer_caption_generation/blob/master/codes/sample_code.ipynb 59 | 60 | Also, you can try beam search as: 61 | ``` 62 | cd codes 63 | python generate_caption_beam.py -b 3 -i ../images/test_image.jpg 64 | ``` 65 | -b option indicates beam size. Default is 3. 66 | 67 | ## I want to train the model by myself. 68 | I extracted the GoogleNet features and pickled, so you use it for training. 69 | ``` 70 | cd codes 71 | python train_caption_model.py 72 | python train_caption_model.py -g 0 # to use gpu. change the number to gpu_id 73 | ``` 74 | The log and trained model will be saved to a directory (experiment1 is defalt) 75 | If you want to change, use -d option. 76 | ``` 77 | python train_caption_model.py -d ./yourdirectory 78 | ``` 79 | 80 | ## I want to train from other data. 81 | Sorry, current implementation does not support it. You need to preprocess the data. Maybe you can read and modify the code. 82 | 83 | ## I want to fine-tune CNN part. 84 | Sorry, current implementation does not support it. Maybe you can read and modify the code. 85 | 86 | ## I want to generate Japanese caption. 87 | I made pre-trained Japanese caption model available. You can download Japanese caption model with the following script. 88 | ``` 89 | bash download.sh 90 | bash download_jp.sh 91 | ``` 92 | ``` 93 | cd codes 94 | python generate_caption.py -v ../work/index2token_jp.pkl -m ../models/caption_model_jp.chainer -i ../images/test_image.jpg 95 | ``` 96 | Japnese Notebook: https://github.com/apple2373/chainer_caption_generation/blob/master/codes/sample_code_jp.ipynb 97 | Japnese Blogpost: http://t-satoshi.blogspot.com/2016/01/blog-post_1.html 98 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/__init__.py -------------------------------------------------------------------------------- /codes/MSCOCO.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import json 5 | import nltk 6 | 7 | def read_MSCOCO_json(file_place): 8 | 9 | f = open(file_place, 'r') 10 | jsonData = json.load(f) 11 | f.close() 12 | 13 | captions={}#key is sentence_length. 14 | caption_id2tokens={} 15 | caption_id2image_id={} 16 | 17 | for caption_data in jsonData['annotations']: 18 | caption_id=caption_data['id'] 19 | image_id=caption_data['image_id'] 20 | caption=caption_data['caption'] 21 | 22 | caption=caption.replace('\n', '').strip().lower() 23 | if caption[-1]=='.':#to delete the last period. 24 | caption=caption[0:-1] 25 | 26 | caption_tokens=[''] 27 | caption_tokens += nltk.word_tokenize(caption) 28 | caption_tokens.append("") 29 | caption_length=len(caption_tokens) 30 | 31 | if caption_length in captions: 32 | captions[caption_length].add(caption_id) 33 | else: 34 | captions[caption_length]=set([caption_id]) 35 | 36 | caption_id2tokens[caption_id]=caption_tokens 37 | caption_id2image_id[caption_id]=image_id 38 | 39 | return captions,caption_id2tokens,caption_id2image_id 40 | -------------------------------------------------------------------------------- /codes/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/codes/__init__.py -------------------------------------------------------------------------------- /codes/caption_generator.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python 3 | 4 | ''' 5 | If you want to integrate caption generation system for your system, you can import this module. 6 | ''' 7 | 8 | import os 9 | #comment out the below if you want to do type check. Remeber this have to be done BEFORE import chainer 10 | #os.environ["CHAINER_TYPE_CHECK"] = "0" 11 | import chainer 12 | #If the below is false, the type check is disabled. 13 | #print(chainer.functions.Linear(1,1).type_check_enable) 14 | 15 | import numpy as np 16 | import math 17 | from chainer import cuda 18 | import chainer.functions as F 19 | from chainer import cuda, Function, FunctionSet, gradient_check, Variable, optimizers 20 | from chainer import serializers 21 | import pickle 22 | import copy 23 | from image_reader import Image_reader 24 | 25 | class Caption_generator(object): 26 | def __init__(self,caption_model_place,cnn_model_place,index2word_place,gpu_id=-1,beamsize=3): 27 | #basic paramaters you need to modify 28 | self.gpu_id=gpu_id# GPU ID. if you want to use cpu, -1 29 | self.beamsize=beamsize 30 | 31 | #Gpu Setting 32 | global xp 33 | if self.gpu_id >= 0: 34 | xp = cuda.cupy 35 | cuda.get_device(gpu_id).use() 36 | else: 37 | xp=np 38 | 39 | # Prepare dataset 40 | with open(index2word_place, 'r') as f: 41 | self.index2word = pickle.load(f) 42 | vocab=self.index2word 43 | 44 | #Load Caffe Model 45 | with open(cnn_model_place, 'r') as f: 46 | self.func = pickle.load(f) 47 | 48 | #Model Preparation 49 | image_feature_dim=1024#dimension of image feature 50 | self.n_units = 512 #number of units per layer 51 | n_units = 512 52 | self.model = FunctionSet() 53 | self.model.img_feature2vec=F.Linear(image_feature_dim, n_units)#CNN(I)の最後のレイヤーに相当。#parameter W,b 54 | self.model.embed=F.EmbedID(len(vocab), n_units)#W_e*S_tに相当 #parameter W 55 | self.model.l1_x=F.Linear(n_units, 4 * n_units)#parameter W,b 56 | self.model.l1_h=F.Linear(n_units, 4 * n_units)#parameter W,b 57 | self.model.out=F.Linear(n_units, len(vocab))#parameter W,b 58 | serializers.load_hdf5(caption_model_place, self.model)#read pre-trained model 59 | 60 | #To GPU 61 | if gpu_id >= 0: 62 | self.model.to_gpu() 63 | self.func.to_gpu() 64 | 65 | #to avoid overflow. 66 | #I don't know why, but this model overflows at the first time only with CPU. 67 | #So I intentionally make overflow so that it never happns after that. 68 | if gpu_id < 0: 69 | numpy_image = np.ones((3, 224,224), dtype=np.float32) 70 | self.generate(numpy_image) 71 | 72 | def feature_exractor(self,x_chainer_variable): #to extract image feature by CNN. 73 | y, = self.func(inputs={'data': x_chainer_variable}, outputs=['pool5/7x7_s1'], 74 | disable=['loss1/ave_pool', 'loss2/ave_pool','loss3/classifier'], 75 | train=False) 76 | return y 77 | 78 | def forward_one_step_for_image(self,img_feature, state, volatile='on'): 79 | x = img_feature#img_feature is chainer.variable. 80 | h0 = self.model.img_feature2vec(x) 81 | h1_in = self.model.l1_x(F.dropout(h0,train=False)) + self.model.l1_h(state['h1']) 82 | c1, h1 = F.lstm(state['c1'], h1_in) 83 | y = self.model.out(F.dropout(h1,train=False))#don't forget to change drop out into non train mode. 84 | state = {'c1': c1, 'h1': h1} 85 | return state, F.softmax(y) 86 | 87 | #forward_one_step is after the CNN layer, 88 | #h0 is n_units dimensional vector (embedding) 89 | def forward_one_step(self,cur_word, state, volatile='on'): 90 | x = chainer.Variable(cur_word, volatile) 91 | h0 = self.model.embed(x) 92 | h1_in = self.model.l1_x(F.dropout(h0,train=False)) + self.model.l1_h(state['h1']) 93 | c1, h1 = F.lstm(state['c1'], h1_in) 94 | y = self.model.out(F.dropout(h1,train=False)) 95 | state = {'c1': c1, 'h1': h1} 96 | return state, F.softmax(y) 97 | 98 | def beam_search(self,sentence_candidates,final_sentences,depth=1,beamsize=3): 99 | volatile=True 100 | next_sentence_candidates_temp=list() 101 | for sentence_tuple in sentence_candidates: 102 | cur_sentence=sentence_tuple[0] 103 | cur_index=sentence_tuple[0][-1] 104 | cur_index_xp=xp.array([cur_index],dtype=np.int32) 105 | cur_state=sentence_tuple[1] 106 | cur_log_likely=sentence_tuple[2] 107 | 108 | state, predicted_word = self.forward_one_step(cur_index_xp,cur_state, volatile=volatile) 109 | predicted_word_np=cuda.to_cpu(predicted_word.data) 110 | top_indexes=(-predicted_word_np).argsort()[0][:beamsize] 111 | 112 | for index in np.nditer(top_indexes): 113 | index=int(index) 114 | probability=predicted_word_np[0][index] 115 | next_sentence=copy.deepcopy(cur_sentence) 116 | next_sentence.append(index) 117 | log_likely=math.log(probability) 118 | next_log_likely=cur_log_likely+log_likely 119 | next_sentence_candidates_temp.append((next_sentence,state,next_log_likely))# make each sentence tuple 120 | 121 | prob_np_array=np.array([sentence_tuple[2] for sentence_tuple in next_sentence_candidates_temp]) 122 | top_candidates_indexes=(-prob_np_array).argsort()[:beamsize] 123 | next_sentence_candidates=list() 124 | for i in top_candidates_indexes: 125 | sentence_tuple=next_sentence_candidates_temp[i] 126 | index=sentence_tuple[0][-1] 127 | if self.index2word[index]=='': 128 | final_sentence=sentence_tuple[0] 129 | final_likely=sentence_tuple[2] 130 | final_probability=math.exp(final_likely) 131 | final_sentences.append((final_sentence,final_probability,final_likely)) 132 | else: 133 | next_sentence_candidates.append(sentence_tuple) 134 | 135 | if len(final_sentences)>=beamsize: 136 | return final_sentences 137 | elif depth==50: 138 | return final_sentences 139 | else: 140 | depth+=1 141 | return self.beam_search(next_sentence_candidates,final_sentences,depth,beamsize) 142 | 143 | def generate(self,numpy_image): 144 | '''Generate Caption for an Numpy Image array 145 | 146 | Args: 147 | numpy_image: numpy image 148 | 149 | Returns: 150 | list of generated captions. The structure is [caption,caption,caption,...] 151 | Where caption = {"sentence":This is a generated sentence, "probability": The probability of the generated sentence} 152 | 153 | ''' 154 | 155 | #initial step 156 | x_batch = np.ndarray((1, 3, 224,224), dtype=np.float32) 157 | x_batch[0]=numpy_image 158 | 159 | volatile=True 160 | if self.gpu_id >=0: 161 | x_batch_chainer = Variable(cuda.to_gpu(x_batch),volatile=volatile) 162 | else: 163 | x_batch_chainer = Variable(x_batch,volatile=volatile) 164 | 165 | batchsize=1 166 | #image is chainer.variable. 167 | state = {name: chainer.Variable(xp.zeros((batchsize, self.n_units),dtype=np.float32),volatile) for name in ('c1', 'h1')} 168 | img_feature=self.feature_exractor(x_batch_chainer) 169 | state, predicted_word = self.forward_one_step_for_image(img_feature,state, volatile=volatile) 170 | 171 | if self.gpu_id >=0: 172 | index=cuda.to_cpu(predicted_word.data.argmax(1))[0] 173 | else: 174 | index=predicted_word.data.argmax(1)[0] 175 | 176 | probability=predicted_word.data[0][index] 177 | initial_sentence_candidates=[([index],state,probability)] 178 | 179 | final_sentences=list() 180 | generated_sentence_candidates=self.beam_search(initial_sentence_candidates,final_sentences,beamsize=self.beamsize) 181 | 182 | #convert to index to strings 183 | 184 | generated_string_sentence_candidates=[] 185 | for sentence_tuple in generated_sentence_candidates: 186 | sentence=[self.index2word[index] for index in sentence_tuple[0]][1:-1] 187 | probability=sentence_tuple[1] 188 | final_likely=sentence_tuple[2] 189 | 190 | a_candidate={'sentence':sentence,'probability':probability,'log_probability':final_likely} 191 | 192 | generated_string_sentence_candidates.append(a_candidate) 193 | 194 | 195 | return generated_string_sentence_candidates 196 | 197 | def generate_temp(self,numpy_image): 198 | 199 | '''Simple Generate Caption for an Numpy Image array 200 | 201 | Args: 202 | numpy_image: numpy image 203 | 204 | Returns: 205 | string of generated capiton 206 | ''' 207 | 208 | genrated_sentence_string='' 209 | x_batch = np.ndarray((1, 3, 224,224), dtype=np.float32) 210 | x_batch[0]=numpy_image 211 | 212 | volatile=True 213 | if self.gpu_id >=0: 214 | x_batch_chainer = Variable(cuda.to_gpu(x_batch),volatile=volatile) 215 | else: 216 | x_batch_chainer = Variable(x_batch,volatile=volatile) 217 | 218 | batchsize=1 219 | 220 | #image is chainer.variable. 221 | state = {name: chainer.Variable(xp.zeros((batchsize, self.n_units),dtype=np.float32),volatile) for name in ('c1', 'h1')} 222 | img_feature=self.feature_exractor(x_batch_chainer) 223 | #img_feature_chainer is chainer.variable of extarcted feature. 224 | state = {name: chainer.Variable(xp.zeros((batchsize, self.n_units),dtype=np.float32),volatile) for name in ('c1', 'h1')} 225 | state, predicted_word = self.forward_one_step_for_image(img_feature,state, volatile=volatile) 226 | index=predicted_word.data.argmax(1) 227 | index=cuda.to_cpu(index)[0] 228 | #genrated_sentence_string+=index2word[index] #dont's add it because this is 229 | 230 | for i in xrange(50): 231 | state, predicted_word = self.forward_one_step(predicted_word.data.argmax(1).astype(np.int32),state, volatile=volatile) 232 | index=predicted_word.data.argmax(1) 233 | index=cuda.to_cpu(index)[0] 234 | if self.index2word[index]=='': 235 | genrated_sentence_string=genrated_sentence_string.strip() 236 | break; 237 | genrated_sentence_string+=self.index2word[index]+" " 238 | 239 | return genrated_sentence_string 240 | 241 | def get_top_sentence(self,numpy_image): 242 | ''' 243 | just get a top sentence as string 244 | 245 | Args: 246 | numpy_image: numpy image 247 | 248 | Returns: 249 | string of generated capiton 250 | ''' 251 | candidates=self.generate(numpy_image) 252 | scores=[caption['log_probability'] for caption in candidates] 253 | argmax=np.argmax(scores) 254 | top_caption=candidates[argmax]['sentence'] 255 | 256 | sentence = '' 257 | for word in top_caption: 258 | sentence+=word+' ' 259 | 260 | return sentence.strip() 261 | 262 | 263 | 264 | -------------------------------------------------------------------------------- /codes/evaluate_model.py: -------------------------------------------------------------------------------- 1 | #under construction. 2 | #I do not use this. 3 | 4 | 5 | file_place = '../data/MSCOCO/annotations/captions_val2014.json' 6 | val_captions,val_caption_id2tokens,val_caption_id2image_id = read_MSCOCO_json(file_place) 7 | 8 | #Validiation Set 9 | print "testing" 10 | num_val_data=len(val_caption_id2image_id) 11 | caption_ids_batches=[] 12 | for caption_length in val_captions.keys(): 13 | caption_ids_set=val_captions[caption_length] 14 | caption_ids=list(caption_ids_set) 15 | caption_ids_batches+=[caption_ids[x:x + batchsize] for x in xrange(0, len(caption_ids), batchsize)] 16 | 17 | sum_loss = 0 18 | file_base='../data/MSCOCO/val2014/COCO_val2014_' 19 | for i, caption_ids_batch in enumerate(caption_ids_batches): 20 | captions_batch=[val_caption_id2sentence[caption_id] for caption_id in caption_ids_batch] 21 | sentences=xp.array(captions_batch,dtype=np.int32) 22 | image_ids_batch=[val_caption_id2image_id[caption_id] for caption_id in caption_ids_batch] 23 | 24 | try: 25 | images=images_read(image_ids_batch,file_base,volatile=True) 26 | except Exception as e: 27 | print 'image reading error' 28 | print 'type:' + str(type(e)) 29 | print 'args:' + str(e.args) 30 | print 'message:' + e.message 31 | print image_ids_batch 32 | continue 33 | 34 | batchsize=normal_batchsize#becasue I am adusting batch size depending on sentence length, I need to rechange it. 35 | if len(caption_ids_batch) != batchsize: 36 | batchsize=len(caption_ids_batch) 37 | #last batch may be less than batchsize. Or depend on caption_length 38 | 39 | loss = forward(images,sentences,volatile=True) 40 | 41 | sum_loss += loss.data * batchsize 42 | 43 | mean_loss = sum_loss / num_val_data 44 | print mean_loss 45 | with open(savedir+"test_mean_loss.txt", "a") as f: 46 | f.write(str(mean_loss)+'\n') -------------------------------------------------------------------------------- /codes/generate_caption.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python 3 | #compatible chiner 1.5 4 | 5 | 6 | import os 7 | #comment out the below if you want to do type check. Remeber this have to be done BEFORE import chainer 8 | #os.environ["CHAINER_TYPE_CHECK"] = "0" 9 | import chainer 10 | #If the below is false, the type check is disabled. 11 | #print(chainer.functions.Linear(1,1).type_check_enable) 12 | 13 | import argparse 14 | import os 15 | import numpy as np 16 | from chainer import cuda 17 | import chainer.functions as F 18 | from chainer import cuda, Function, FunctionSet, gradient_check, Variable, optimizers 19 | #import matplotlib.pyplot as plt 20 | from chainer import serializers 21 | 22 | from scipy.misc import imread, imresize, imsave 23 | import json 24 | import random 25 | import pickle 26 | import math 27 | import skimage.transform 28 | 29 | #Settings can be changed by command line arguments 30 | gpu_id=-1# GPU ID. if you want to use cpu, -1 31 | model_place='../models/caption_model.chainer' 32 | caffe_model_place='../data/bvlc_googlenet_caffe_chainer.pkl' 33 | index2word_file = '../work/index2token.pkl' 34 | image_file_name='../images/test_image.jpg' 35 | 36 | 37 | 38 | #Override Settings by argument 39 | parser = argparse.ArgumentParser(description=u"caption generation") 40 | parser.add_argument("-g", "--gpu",default=gpu_id, type=int, help=u"GPU ID.CPU is -1") 41 | parser.add_argument("-m", "--model",default=model_place, type=str, help=u" caption generation model") 42 | parser.add_argument("-c", "--caffe",default=caffe_model_place, type=str, help=u" pre trained caffe model pickled after imported to chainer") 43 | parser.add_argument("-v", "--vocab",default=index2word_file, type=str, help=u" vocaburary file") 44 | parser.add_argument("-i", "--image",default=image_file_name, type=str, help=u"a image that you want to generate capiton ") 45 | 46 | args = parser.parse_args() 47 | gpu_id=args.gpu 48 | model_place= args.model 49 | index2word_file = args.vocab 50 | image_file_name = args.image 51 | caffe_model_place = args.caffe 52 | 53 | #Gpu Setting 54 | if gpu_id >= 0: 55 | xp = cuda.cupy 56 | cuda.get_device(gpu_id).use() 57 | else: 58 | xp=np 59 | 60 | #Basic Setting 61 | image_feature_dim=1024#dimension of image feature 62 | n_units = 512 #number of units per layer 63 | 64 | 65 | # Prepare dataset 66 | print "loading vocab" 67 | with open(index2word_file, 'r') as f: 68 | index2word = pickle.load(f) 69 | 70 | vocab=index2word 71 | 72 | 73 | #Load Caffe Model 74 | print "loading caffe models" 75 | with open(caffe_model_place, 'r') as f: 76 | func = pickle.load(f) 77 | 78 | if gpu_id>= 0: 79 | func.to_gpu() 80 | print "done" 81 | 82 | def feature_exractor(x_chainer_variable): #to extract image feature by CNN. 83 | y, = func(inputs={'data': x_chainer_variable}, outputs=['pool5/7x7_s1'], 84 | disable=['loss1/ave_pool', 'loss2/ave_pool','loss3/classifier'], 85 | train=False) 86 | return y 87 | 88 | #Read image from file into numpy. 89 | #several codes are copied from here: https://github.com/ebenolson/Recipes/blob/master/examples/imagecaption/COCO%20Preprocessing.ipynb 90 | #see also https://groups.google.com/forum/#!toself.pic/lasagne-users/cCFVeT5rw-o 91 | MEAN_VALUES = np.array([104, 117, 123]).reshape((3,1,1)) 92 | def image_read_np(file_place): 93 | im = imread(file_place) 94 | if len(im.shape) == 2: 95 | im = im[:, :, np.newaxis] 96 | im = np.repeat(im, 3, axis=2) 97 | # Resize so smallest dim = 224, preserving aspect ratio 98 | h, w, _ = im.shape 99 | if h < w: 100 | im = skimage.transform.resize(im, (224, w*224/h), preserve_range=True) 101 | else: 102 | im = skimage.transform.resize(im, (h*224/w, 224), preserve_range=True) 103 | 104 | # Central crop to 224x224 105 | h, w, _ = im.shape 106 | im = im[h//2-112:h//2+112, w//2-112:w//2+112] 107 | 108 | rawim = np.copy(im).astype('uint8') 109 | 110 | # Shuffle axes to c01 111 | im = np.swapaxes(np.swapaxes(im, 1, 2), 0, 1) 112 | 113 | # Convert to BGR 114 | im = im[::-1, :, :] 115 | 116 | im = im - MEAN_VALUES 117 | return rawim.transpose(2, 0, 1).astype(np.float32) 118 | 119 | #Model Preparation 120 | print "preparing caption generation models" 121 | model = FunctionSet() 122 | model.img_feature2vec=F.Linear(image_feature_dim, n_units)#CNN(I)の最後のレイヤーに相当。#parameter W,b 123 | model.embed=F.EmbedID(len(vocab), n_units)#W_e*S_tに相当 #parameter W 124 | model.l1_x=F.Linear(n_units, 4 * n_units)#parameter W,b 125 | model.l1_h=F.Linear(n_units, 4 * n_units)#parameter W,b 126 | model.out=F.Linear(n_units, len(vocab))#parameter W,b 127 | 128 | serializers.load_hdf5(model_place, model) 129 | 130 | #To GPU 131 | if gpu_id >= 0: 132 | model.to_gpu() 133 | print "done" 134 | 135 | #Define Newtowork (Forward) 136 | 137 | #forward_one_step is after the CNN layer, 138 | #h0 is n_units dimensional vector (embedding) 139 | def forward_one_step(cur_word, state, volatile='on'): 140 | x = chainer.Variable(cur_word, volatile) 141 | h0 = model.embed(x) 142 | h1_in = model.l1_x(F.dropout(h0,train=False)) + model.l1_h(state['h1']) 143 | c1, h1 = F.lstm(state['c1'], h1_in) 144 | y = model.out(F.dropout(h1,train=False)) 145 | state = {'c1': c1, 'h1': h1} 146 | return state, y 147 | 148 | def forward_one_step_for_image(img_feature, state, volatile='on'): 149 | x = img_feature#img_feature is chainer.variable. 150 | h0 = model.img_feature2vec(x) 151 | h1_in = model.l1_x(F.dropout(h0,train=False)) + model.l1_h(state['h1']) 152 | c1, h1 = F.lstm(state['c1'], h1_in) 153 | y = model.out(F.dropout(h1,train=False))#don't forget to change drop out into non train mode. 154 | state = {'c1': c1, 'h1': h1} 155 | return state, y 156 | 157 | #to avoid overflow. 158 | #I don't know why, but this model overflows only at the first time. 159 | #So I intentionally make overflow so that it never happns after that. 160 | if gpu_id < 0: 161 | x_batch = np.ones((1, 3, 224,224), dtype=np.float32) 162 | x_batch_chainer = Variable(x_batch) 163 | img_feature=feature_exractor(x_batch_chainer) 164 | state = {name: chainer.Variable(xp.zeros((1, n_units),dtype=np.float32)) for name in ('c1', 'h1')} 165 | state, predicted_word = forward_one_step_for_image(img_feature,state) 166 | 167 | def caption_generate(image_file_name): 168 | print('sentence generation started') 169 | 170 | genrated_sentence=[] 171 | volatile=True 172 | 173 | image=image_read_np(image_file_name) 174 | x_batch = np.ndarray((1, 3, 224,224), dtype=np.float32) 175 | x_batch[0]=image 176 | 177 | if gpu_id >=0: 178 | x_batch_chainer = Variable(cuda.to_gpu(x_batch),volatile=volatile) 179 | else: 180 | x_batch_chainer = Variable(x_batch,volatile=volatile) 181 | 182 | batchsize=1 183 | 184 | #image is chainer.variable. 185 | state = {name: chainer.Variable(xp.zeros((batchsize, n_units),dtype=np.float32),volatile) for name in ('c1', 'h1')} 186 | img_feature=feature_exractor(x_batch_chainer) 187 | state, predicted_word = forward_one_step_for_image(img_feature,state, volatile=volatile) 188 | genrated_sentence.append(predicted_word.data) 189 | 190 | for i in xrange(50): 191 | state, predicted_word = forward_one_step(predicted_word.data.argmax(1).astype(np.int32),state, volatile=volatile) 192 | genrated_sentence.append(predicted_word.data) 193 | 194 | print("---genrated_sentence--") 195 | 196 | for predicted_word in genrated_sentence: 197 | if gpu_id >=0: 198 | index=cuda.to_cpu(predicted_word.argmax(1))[0] 199 | else: 200 | index=predicted_word.argmax(1)[0] 201 | print index2word[index] 202 | if index2word[index]=='': 203 | xp.max(predicted_word) 204 | x_batch_chainer = Variable(predicted_word,volatile=volatile) 205 | print xp.max(F.softmax(x_batch_chainer).data) 206 | break 207 | 208 | caption_generate(image_file_name) -------------------------------------------------------------------------------- /codes/generate_caption_beam.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #!/usr/bin/env python 3 | 4 | import numpy as np 5 | import argparse 6 | from image_reader import Image_reader 7 | from caption_generator import Caption_generator 8 | 9 | #Settings can be changed by command line arguments 10 | gpu_id=-1# GPU ID. if you want to use cpu, -1 11 | model_place='../models/caption_model.chainer' 12 | caffe_model_place='../data/bvlc_googlenet_caffe_chainer.pkl' 13 | index2word_file = '../work/index2token.pkl' 14 | image_file_name='../images/test_image.jpg' 15 | beamsize=3 16 | 17 | #Override Settings by argument 18 | parser = argparse.ArgumentParser(description=u"caption generation") 19 | parser.add_argument("-g", "--gpu",default=gpu_id, type=int, help=u"GPU ID.CPU is -1") 20 | parser.add_argument("-m", "--model",default=model_place, type=str, help=u" caption generation model") 21 | parser.add_argument("-c", "--caffe",default=caffe_model_place, type=str, help=u" pre trained caffe model pickled after imported to chainer") 22 | parser.add_argument("-v", "--vocab",default=index2word_file, type=str, help=u" vocaburary file") 23 | parser.add_argument("-i", "--image",default=image_file_name, type=str, help=u"a image that you want to generate capiton ") 24 | parser.add_argument("-b", "--beam",default=beamsize, type=int, help=u"a image that you want to generate capiton ") 25 | 26 | args = parser.parse_args() 27 | gpu_id=args.gpu 28 | model_place= args.model 29 | index2word_file = args.vocab 30 | image_file_name = args.image 31 | caffe_model_place = args.caffe 32 | beamsize = args.beam 33 | 34 | 35 | #Instantiate image_reader with GoogleNet mean image 36 | mean_image = np.array([104, 117, 123]).reshape((3,1,1))#GoogleNet Mean 37 | image_reader=Image_reader(mean=mean_image) 38 | 39 | #Instantiate caption generator 40 | caption_generator=Caption_generator(caption_model_place=model_place,cnn_model_place=caffe_model_place,index2word_place=index2word_file,beamsize=beamsize,gpu_id=gpu_id) 41 | 42 | #Read Image 43 | image=image_reader.read(image_file_name) 44 | 45 | #Generate Catpion 46 | captions=caption_generator.generate(image) 47 | 48 | #print it 49 | for caption in captions: 50 | sentence=caption['sentence'] 51 | probability=caption['probability'] 52 | print " ".join(sentence),probability 53 | 54 | -------------------------------------------------------------------------------- /codes/image_reader.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' 5 | The class to read an image as numpy array. 6 | This is particurary designed for ImageNet related task. 7 | So, whatever size the input image have, the output will be centor-croped image of 224*224 8 | Also, you can specify the mean image for CNNs like GoogleNet or VGG 9 | ''' 10 | 11 | import numpy as np 12 | from scipy.misc import imread, imresize 13 | import skimage.transform 14 | 15 | class Image_reader(object): 16 | def __init__(self,mean=np.zeros((3,1,1))): 17 | self.mean_image = mean 18 | 19 | #taken from https://github.com/ebenolson/Recipes/blob/master/examples/imagecaption/COCO%20Preprocessing.ipynb 20 | #see also https://groups.google.com/forum/#!toself.pic/lasagne-users/cCFVeT5rw-o 21 | def read(self,file_place): 22 | im = imread(file_place) 23 | if len(im.shape) == 2: 24 | im = im[:, :, np.newaxis] 25 | im = np.repeat(im, 3, axis=2) 26 | 27 | # Resize so smallest dim = 224, preserving aspect ratio 28 | h, w, _ = im.shape 29 | if h < w: 30 | im = skimage.transform.resize(im, (224, w*224/h), preserve_range=True) 31 | else: 32 | im = skimage.transform.resize(im, (h*224/w, 224), preserve_range=True) 33 | 34 | # Central crop to 224x224 35 | h, w, _ = im.shape 36 | im = im[h//2-112:h//2+112, w//2-112:w//2+112] 37 | 38 | rawim = np.copy(im).astype('uint8') 39 | 40 | # Shuffle axes to c01 41 | im = np.swapaxes(np.swapaxes(im, 1, 2), 0, 1) 42 | 43 | # Convert to BGR 44 | # We should know OpenCV's default is BGR instead of RGB 45 | im = im[::-1, :, :] 46 | 47 | im = im - self.mean_image 48 | return rawim.transpose(2, 0, 1).astype(np.float32) 49 | 50 | def crop_for_plot(self,file_place): 51 | im = imread(file_place) 52 | if len(im.shape) == 2: 53 | im = im[:, :, np.newaxis] 54 | im = np.repeat(im, 3, axis=2) 55 | # Resize so smallest dim = 224, preserving aspect ratio 56 | h, w, _ = im.shape 57 | if h < w: 58 | im = skimage.transform.resize(im, (224, w*224/h), preserve_range=True) 59 | else: 60 | im = skimage.transform.resize(im, (h*224/w, 224), preserve_range=True) 61 | 62 | # Central crop to 224x224 63 | h, w, _ = im.shape 64 | im = im[h//2-112:h//2+112, w//2-112:w//2+112] 65 | 66 | rawim = np.copy(im).astype('uint8') 67 | 68 | # Shuffle axes to c01 69 | im = np.swapaxes(np.swapaxes(im, 1, 2), 0, 1) 70 | 71 | # Convert to BGR 72 | im = im[::-1, :, :] 73 | 74 | im = im - MEAN_VALUES 75 | return rawim -------------------------------------------------------------------------------- /codes/pre_extract_googlenet_features.py: -------------------------------------------------------------------------------- 1 | ''' 2 | To extarct CNN features. 3 | 4 | This code could be messy. 5 | I did not assume others use this, but decided to make avaiable, 6 | because I saw many people who wants to use VGG insetad of GoogleNet. 7 | But remember that this is for GoogleNet. 8 | ''' 9 | #!/usr/bin/env python 10 | # -*- coding: utf-8 -*- 11 | 12 | 13 | # import os 14 | # os.environ["CHAINER_TYPE_CHECK"] = "0" #to disable type check 15 | import chainer 16 | 17 | import argparse 18 | import os 19 | import numpy as np 20 | from chainer import cuda 21 | import chainer.functions as F 22 | from chainer.functions import caffe 23 | from chainer import cuda, Function, FunctionSet, gradient_check, Variable, optimizers 24 | #import matplotlib.pyplot as plt 25 | from scipy.misc import imread, imresize, imsave 26 | import json 27 | import nltk 28 | import random 29 | import pickle 30 | import math 31 | import skimage.transform 32 | 33 | 34 | #Settings can be changed by command line arguments 35 | gpu_id=-1# GPU ID. if you want to use cpu, -1 36 | #gpu_id=0 37 | savedir='../work/img_features/'# name of log and results image saving directory 38 | image_feature_dim=1024#特徴の次元数。 39 | 40 | #Functions 41 | def get_image_ids(file_place): 42 | 43 | f = open(file_place, 'r') 44 | jsonData = json.load(f) 45 | f.close() 46 | 47 | image_id2feature={} 48 | for caption_data in jsonData['annotations']: 49 | image_id=caption_data['image_id'] 50 | image_id2feature[image_id]=np.array([image_feature_dim,]) 51 | 52 | return image_id2feature 53 | 54 | #Gpu Setting 55 | if gpu_id >= 0: 56 | xp = cuda.cupy 57 | cuda.get_device(gpu_id).use() 58 | else: 59 | xp=np 60 | 61 | #画像読み込み関数 62 | #ただ読むだけ 63 | MEAN_VALUES = np.array([104, 117, 123]).reshape((3,1,1)) 64 | def image_read_np(file_place): 65 | im = imread(file_place) 66 | if len(im.shape) == 2: 67 | im = im[:, :, np.newaxis] 68 | im = np.repeat(im, 3, axis=2) 69 | # Resize so smallest dim = 224, preserving aspect ratio 70 | h, w, _ = im.shape 71 | if h < w: 72 | im = skimage.transform.resize(im, (224, w*224/h), preserve_range=True) 73 | else: 74 | im = skimage.transform.resize(im, (h*224/w, 224), preserve_range=True) 75 | 76 | # Central crop to 224x224 77 | h, w, _ = im.shape 78 | im = im[h//2-112:h//2+112, w//2-112:w//2+112] 79 | 80 | rawim = np.copy(im).astype('uint8') 81 | 82 | # Shuffle axes to c01 83 | im = np.swapaxes(np.swapaxes(im, 1, 2), 0, 1) 84 | 85 | # Convert to BGR 86 | im = im[::-1, :, :] 87 | 88 | im = im - MEAN_VALUES 89 | return rawim.transpose(2, 0, 1).astype(np.float32) 90 | 91 | #main 92 | 93 | # Prepare dataset 94 | file_place = '../data/MSCOCO/annotations/captions_train2014.json' 95 | train_image_id2feature=get_image_ids(file_place) 96 | file_place = '../data/MSCOCO/annotations/captions_val2014.json' 97 | val_image_id2feature=get_image_ids(file_place) 98 | 99 | 100 | #Caffeモデルをロード 101 | print "loading caffe models" 102 | func = caffe.CaffeFunction('../data/bvlc_googlenet.caffemodel') 103 | if gpu_id>= 0: 104 | func.to_gpu() 105 | print "done" 106 | 107 | 108 | 109 | print 'feature_exractor' 110 | file_base='../data/MSCOCO/train2014/COCO_train2014_' 111 | for i, image_id in enumerate(train_image_id2feature.keys()): 112 | 113 | if i%5000==0: 114 | print i 115 | 116 | try: 117 | image=image_read_np(file_base+str("{0:012d}".format(image_id)+'.jpg')) 118 | except Exception as e: 119 | print 'image reading error' 120 | print 'type:' + str(type(e)) 121 | print 'args:' + str(e.args) 122 | print 'message:' + e.message 123 | print image_id 124 | continue 125 | 126 | x_batch = np.ndarray((1, 3, 224,224), dtype=np.float32) 127 | x_batch[0]=image 128 | if gpu_id >=0: 129 | x = Variable(cuda.to_gpu(x_batch), volatile=True) 130 | else: 131 | x = Variable(x_batch, volatile=True) 132 | image_feature_chainer, = func(inputs={'data': x}, outputs=['pool5/7x7_s1'], 133 | disable=['loss1/ave_pool', 'loss2/ave_pool','loss3/classifier'], 134 | train=False) 135 | 136 | image_feature_np=image_feature_chainer.data.reshape(1024) 137 | train_image_id2feature[image_id]=cuda.to_cpu(image_feature_np) 138 | 139 | 140 | pickle.dump(train_image_id2feature, open(savedir+"train_image_id2feature.pkl", 'wb'), -1) 141 | 142 | print "for test" 143 | file_base='../data/MSCOCO/val2014/COCO_val2014_' 144 | for i, image_id in enumerate(val_image_id2feature.keys()): 145 | 146 | if i%5000==0: 147 | print i 148 | 149 | try: 150 | image=image_read_np(file_base+str("{0:012d}".format(image_id)+'.jpg')) 151 | except Exception as e: 152 | print 'image reading error' 153 | print 'type:' + str(type(e)) 154 | print 'args:' + str(e.args) 155 | print 'message:' + e.message 156 | print image_id 157 | continue 158 | 159 | x_batch = np.ndarray((1, 3, 224,224), dtype=np.float32) 160 | x_batch[0]=image 161 | if gpu_id >=0: 162 | x = Variable(cuda.to_gpu(x_batch), volatile=True) 163 | else: 164 | x = Variable(x_batch, volatile=True) 165 | image_feature_chainer, = func(inputs={'data': x}, outputs=['pool5/7x7_s1'], 166 | disable=['loss1/ave_pool', 'loss2/ave_pool','loss3/classifier'], 167 | train=False) 168 | 169 | image_feature_np=image_feature_chainer.data.reshape(1024) 170 | val_image_id2feature[image_id]=cuda.to_cpu(image_feature_np) 171 | 172 | pickle.dump(val_image_id2feature, open(savedir+"val_image_id2feature.pkl", 'wb'), -1) -------------------------------------------------------------------------------- /codes/prepocess_captions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | This program preprocesses the caption into picke file. 6 | Main purpose is to tokenize, make lower case, and filter out low frequent vocaburaries. 7 | Note tokenize and make lower case is done by a function (read_MSCOCO_json) in another file MSCOCO.py. 8 | """ 9 | 10 | from MSCOCO import read_MSCOCO_json #to read MSCOCO json file. 11 | from gensim import corpora 12 | import pickle 13 | 14 | file_place = '../data/MSCOCO/annotations/captions_train2014.json' 15 | train_captions,train_caption_id2tokens,train_caption_id2image_id = read_MSCOCO_json(file_place) 16 | 17 | texts=train_caption_id2tokens.values() 18 | dictionary = corpora.Dictionary(texts) 19 | dictionary.filter_extremes(no_below=5, no_above=1.0) 20 | dictionary.compactify() # remove gaps in id sequence after words that were removed 21 | index2token = dict((v, k) for k, v in dictionary.token2id.iteritems()) 22 | ukn_id=len(dictionary.token2id) 23 | index2token[ukn_id]='' 24 | 25 | #just save the map from index to token (word) 26 | #that means this is vocaburary file 27 | with open('../work/index2token.pkl', 'w') as f: 28 | pickle.dump(index2token,f) 29 | 30 | 31 | train_caption_id2sentence={} 32 | for (caption_id,tokens) in train_caption_id2tokens.iteritems(): 33 | sentence=[] 34 | for token in tokens: 35 | if token in dictionary.token2id: 36 | sentence.append(dictionary.token2id[token]) 37 | else: 38 | sentence.append(ukn_id) 39 | 40 | train_caption_id2sentence[caption_id]=sentence 41 | 42 | 43 | #Save preprocessed captions. 44 | with open('../work/preprocessed_train_captions.pkl', 'w') as f: 45 | pickle.dump((train_captions,train_caption_id2sentence,train_caption_id2image_id),f) -------------------------------------------------------------------------------- /codes/sample_code.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | ''' 5 | Sample code to generate caption 6 | ''' 7 | import numpy as np 8 | from image_reader import Image_reader 9 | from caption_generator import Caption_generator 10 | 11 | #Instantiate image_reader with GoogleNet mean image 12 | mean_image = np.array([104, 117, 123]).reshape((3,1,1)) 13 | image_reader=Image_reader(mean=mean_image) 14 | 15 | #Instantiate caption generator 16 | caption_model_place='../models/caption_model.chainer' 17 | cnn_model_place='../data/bvlc_googlenet_caffe_chainer.pkl' 18 | index2word_place='../work/index2token.pkl' 19 | caption_generator=Caption_generator(caption_model_place=caption_model_place,cnn_model_place=cnn_model_place,index2word_place=index2word_place) 20 | 21 | 22 | #The preparation is done 23 | #Let's ganarate caption for a image 24 | 25 | #First, read an image as numpy array 26 | image_file_path='../images/test_image.jpg' 27 | image=image_reader.read(image_file_path) 28 | 29 | 30 | #Next, put the image into caption generator 31 | #The output structure is 32 | # [caption,caption,caption,...] 33 | # caption = {"sentence":This is a generated sentence, "probability": The probability of the generated sentence} 34 | captions=caption_generator.generate(image) 35 | 36 | #For example, if you want to print all captions 37 | for caption in captions: 38 | sentence=caption['sentence'] 39 | probability=caption['probability'] 40 | print " ".join(sentence),probability 41 | 42 | #Let's do for another image 43 | image_file_path='../images/COCO_val2014_000000241747.jpg' 44 | image=image_reader.read(image_file_path) 45 | captions=caption_generator.generate(image) 46 | for caption in captions: 47 | sentence=caption['sentence'] 48 | probability=caption['probability'] 49 | print " ".join(sentence),probability 50 | -------------------------------------------------------------------------------- /codes/train_caption_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | 5 | # import os 6 | #os.environ["CHAINER_TYPE_CHECK"] = "0" #to disable type check. 7 | import chainer 8 | #Check che below is False if you disabled type check 9 | #print(chainer.functions.Linear(1,1).type_check_enable) 10 | 11 | import argparse 12 | import numpy as np 13 | import chainer.functions as F 14 | from chainer import cuda 15 | from chainer import Function, FunctionSet, Variable, optimizers, serializers 16 | import pickle 17 | import random 18 | 19 | #Settings can be changed by command line arguments 20 | gpu_id=-1# GPU ID. if you want to use cpu, -1 21 | #gpu_id=4 22 | savedir='../experiment1/'# name of log and results image saving directory 23 | 24 | #Override Settings by argument 25 | parser = argparse.ArgumentParser(description=u"caption generation") 26 | parser.add_argument("-g", "--gpu",default=gpu_id, type=int, help=u"GPU ID.CPU is -1") 27 | parser.add_argument("-d", "--savedir",default=savedir, type=str, help=u"The directory to save models and log") 28 | args = parser.parse_args() 29 | gpu_id=args.gpu 30 | savedir=args.savedir 31 | 32 | #Gpu Setting 33 | if gpu_id >= 0: 34 | xp = cuda.cupy 35 | cuda.get_device(gpu_id).use() 36 | else: 37 | xp=np 38 | 39 | #Prepare Data 40 | print("loading preprocessed data") 41 | 42 | with open('../work/index2token.pkl', 'r') as f: 43 | index2token = pickle.load(f) 44 | 45 | with open('../work/preprocessed_train_captions.pkl', 'r') as f: 46 | train_captions,train_caption_id2sentence,train_caption_id2image_id = pickle.load(f) 47 | 48 | with open('../work/img_features/train_image_id2feature.pkl', 'r') as f: 49 | train_image_id2feature = pickle.load(f) 50 | 51 | #Model Preparation 52 | print "preparing caption generation models" 53 | image_feature_dim=1024#特徴の次元数。 54 | n_units = 512 # number of units per layer 55 | vocab_size=len(index2token) 56 | 57 | model = chainer.FunctionSet() 58 | model.img_feature2vec=F.Linear(image_feature_dim, n_units)#CNN(I)の最後のレイヤーに相当。#parameter W,b 59 | model.embed=F.EmbedID(vocab_size, n_units)#W_e*S_tに相当 #parameter W 60 | model.l1_x=F.Linear(n_units, 4 * n_units)#parameter W,b 61 | model.l1_h=F.Linear(n_units, 4 * n_units)#parameter W,b 62 | model.out=F.Linear(n_units, vocab_size)#parameter W,b 63 | 64 | #Parameter Initialization 65 | #Mimicked Chainer Samples 66 | for param in model.params(): 67 | data = param.data 68 | data[:] = np.random.uniform(-0.1, 0.1, data.shape) 69 | 70 | #set forget bias 1 71 | model.l1_x.b.data[2*n_units:3*n_units]=np.ones(model.l1_x.b.data[2*n_units:3*n_units].shape).astype(xp.float32) 72 | model.l1_h.b.data[2*n_units:3*n_units]=np.ones(model.l1_h.b.data[2*n_units:3*n_units].shape).astype(xp.float32) 73 | 74 | #To GPU 75 | if gpu_id >= 0: 76 | model.to_gpu() 77 | 78 | 79 | #Define Newtowork (Forward) 80 | 81 | #forward_one_stepは画像の話は無視。それはforwardの一回目で特別にやる。 82 | #h0はn_units次元のベクトル(embedding) 83 | #cur_wordはその時の単語のone-hot-vector 84 | #next_wordはそこで出力すべきone-hot-vector(つまり次のー単語) 85 | 86 | 87 | def forward_one_step(cur_word, next_word, state, volatile=False): 88 | x = chainer.Variable(cur_word, volatile) 89 | t = chainer.Variable(next_word, volatile) 90 | h0 = model.embed(x) 91 | h1_in = model.l1_x(F.dropout(h0)) + model.l1_h(state['h1']) 92 | c1, h1 = F.lstm(state['c1'], h1_in) 93 | y = model.out(F.dropout(h1)) 94 | state = {'c1': c1, 'h1': h1} 95 | loss = F.softmax_cross_entropy(y, t) 96 | return state, loss 97 | 98 | def forward_one_step_for_image(img_feature, first_word, state, volatile=False): 99 | print img_feature.shape 100 | x = chainer.Variable(img_feature) 101 | t = chainer.Variable(first_word, volatile) 102 | h0 = model.img_feature2vec(x) 103 | h1_in = model.l1_x(F.dropout(h0)) + model.l1_h(state['h1']) 104 | c1, h1 = F.lstm(state['c1'], h1_in) 105 | y = model.out(F.dropout(h1)) 106 | state = {'c1': c1, 'h1': h1} 107 | loss = F.softmax_cross_entropy(y, t) 108 | return state, loss 109 | 110 | #imageは画像 111 | #x_listはある画像(image)に対応する文章(単語の集まり+EOS) 112 | #つまりx_list=[word1,word2,....,EOS] 113 | def forward(img_feature,sentences, volatile=False): 114 | #imageはすでにchinaer variableである。 115 | state = {name: chainer.Variable(xp.zeros((batchsize, n_units),dtype=xp.float32),volatile) for name in ('c1', 'h1')} 116 | loss = 0 117 | 118 | first_word=sentences.T[0] 119 | #[[w11,w12,...],[w21,w22...]]から[w11,w21]と最初の単語たちを取り出す. 120 | #バッチサイズの数だけ文があって、それぞれの最初の単語だけを取ってきた、一次元の配列を作るということ。 121 | 122 | state, new_loss = forward_one_step_for_image(img_feature, first_word,state, volatile=volatile) 123 | loss += new_loss 124 | 125 | #cur_wordに今の単語のnp.array(1次元) 126 | #next_wordに次の単語のnp.array(1次元) 127 | for cur_word, next_word in zip(sentences.T, sentences.T[1:]): 128 | state, new_loss = forward_one_step(cur_word, next_word,state, volatile=volatile) 129 | loss += new_loss 130 | return loss 131 | 132 | optimizer = optimizers.Adam() 133 | optimizer.setup(model) 134 | 135 | #Trining Setting 136 | normal_batchsize=256 137 | grad_clip = 1.0 138 | num_train_data=len(train_caption_id2image_id) 139 | 140 | #Begin Training 141 | print 'training started' 142 | for epoch in xrange(200): 143 | 144 | print 'epoch %d' %epoch 145 | 146 | batchsize=normal_batchsize 147 | caption_ids_batches=[] 148 | for caption_length in train_captions.keys(): 149 | caption_ids_set=train_captions[caption_length] 150 | caption_ids=list(caption_ids_set) 151 | random.shuffle(caption_ids) 152 | caption_ids_batches+=[caption_ids[x:x + batchsize] for x in xrange(0, len(caption_ids), batchsize)] 153 | random.shuffle(caption_ids_batches) 154 | 155 | # training_bacthes={} 156 | # for i, caption_ids_batch in enumerate(caption_ids_batches): 157 | # images = xp.array([train_image_id2feature[train_caption_id2image_id[caption_id]] for caption_id in caption_ids_batch],dtype=xp.float32) 158 | # sentences = xp.array([train_caption_id2sentence[caption_id] for caption_id in caption_ids_batch],dtype=xp.int32) 159 | # training_bacthes[i]= (images,sentences) 160 | 161 | #This is equivalent for above and hard to read, but I inteitionally did for faster calculation 162 | training_bacthes = \ 163 | { i:\ 164 | (\ 165 | xp.array([train_image_id2feature[train_caption_id2image_id[caption_id]] for caption_id in caption_ids_batch],dtype=xp.float32),\ 166 | xp.array([train_caption_id2sentence[caption_id] for caption_id in caption_ids_batch],dtype=xp.int32)\ 167 | )\ 168 | for i, caption_ids_batch in enumerate(caption_ids_batches)\ 169 | } 170 | 171 | sum_loss = 0 172 | for i, batch in training_bacthes.iteritems(): 173 | images=batch[0] 174 | sentences=batch[1] 175 | 176 | sentence_length=len(sentences[0]) 177 | batchsize=normal_batchsize#reverse batchsize if it is changed due to sentence length. 178 | if len(images) != batchsize: 179 | batchsize=len(images) 180 | #last batch may be less than batchsize. Or depend on caption_length 181 | 182 | optimizer.zero_grads() 183 | loss = forward(images,sentences) 184 | print loss.data 185 | with open(savedir+"real_loss.txt", "a") as f: 186 | f.write(str(loss.data)+'\n') 187 | with open(savedir+"real_loss_per_word.txt", "a") as f: 188 | f.write(str(loss.data/sentence_length)+'\n') 189 | 190 | loss.backward() 191 | #optimizer.clip_grads(grad_clip) 192 | optimizer.update() 193 | 194 | sum_loss += loss.data * batchsize 195 | 196 | serializers.save_hdf5(savedir+"/caption_model"+str(epoch)+'.chainer', model) 197 | serializers.save_hdf5(savedir+"/optimizer"+str(epoch)+'.chainer', optimizer) 198 | 199 | mean_loss = sum_loss / num_train_data 200 | with open(savedir+"mean_loss.txt", "a") as f: 201 | f.write(str(loss.data)+'\n') 202 | 203 | -------------------------------------------------------------------------------- /data/.gitignore: -------------------------------------------------------------------------------- 1 | #gtignore 以外のファイルを全部無視する。 2 | * 3 | !.gitignore 4 | -------------------------------------------------------------------------------- /download.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | cd data 3 | if [ ! -f bvlc_googlenet_caffe_chainer.pkl ]; then 4 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/data/bvlc_googlenet_caffe_chainer.pkl 5 | fi 6 | cd .. 7 | cd work 8 | if [ ! -f index2token.pkl ]; then 9 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/work/index2token.pkl 10 | fi 11 | if [ ! -f preprocessed_train_captions.pkl ]; then 12 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/work/preprocessed_train_captions.pkl 13 | fi 14 | if [ ! -d img_features ]; then 15 | mkdir img_features 16 | fi 17 | cd img_features 18 | if [ ! -f train_image_id2feature.pkl ]; then 19 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/work/img_features/train_image_id2feature.pkl 20 | fi 21 | if [ ! -f val_image_id2feature.pkl ]; then 22 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/work/img_features/val_image_id2feature.pkl 23 | fi 24 | cd ../../ 25 | cd models 26 | if [ ! -f caption_model.chainer ]; then 27 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/models/caption_model.chainer 28 | fi -------------------------------------------------------------------------------- /download_jp.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | cd work 3 | if [ ! -f index2token_jp.pkl ]; then 4 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/work/index2token_jp.pkl 5 | fi 6 | cd .. 7 | cd models 8 | if [ ! -f caption_model_jp.chainer ]; then 9 | wget https://googledrive.com/host/0B046sNk0DhCDeEczcm1vaWlCTFk/models/caption_model_jp.chainer 10 | fi 11 | -------------------------------------------------------------------------------- /evalutation_script/README.md: -------------------------------------------------------------------------------- 1 | # Evaluation Script for MSCOCO 2 | This code is based on the the follwoing repository. 3 | https://github.com/tylin/coco-caption 4 | To use the scripts here, please copy the three folders and thier contents to this place. 5 | annotations 6 | pycocoevalcap 7 | pycocotools 8 | 9 | 10 | ## How to do evaluation? 11 | Prepare the directory that contains several json files for evaluation. 12 | The json file should be: 13 | [{"image_id": 404464, "caption": "black and white photo of a man standing in front of a building"}, {"image_id": 380932, "caption": "group of people are on the side of a snowy field"},...] 14 | Then, it will save json file into results folder by the file name. 15 | -------------------------------------------------------------------------------- /evalutation_script/evalutate_caption_val.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This is a script to evaluate generated captions for validiation files. 3 | Most of the script are from https://github.com/tylin/coco-caption 4 | ''' 5 | 6 | # -*- coding: utf-8 -*- 7 | #!/usr/bin/env python 8 | #compatible chiner 1.5 9 | 10 | from pycocotools.coco import COCO 11 | from pycocoevalcap.eval import COCOEvalCap 12 | import matplotlib.pyplot as plt 13 | import skimage.io as io 14 | import pylab 15 | pylab.rcParams['figure.figsize'] = (10.0, 8.0) 16 | 17 | import json 18 | from json import encoder 19 | encoder.FLOAT_REPR = lambda o: format(o, '.3f') 20 | 21 | model_dir='../experiment1' 22 | 23 | annFile='./annotations/captions_val2014.json' 24 | 25 | # create coco object and cocoRes object 26 | coco = COCO(annFile) 27 | 28 | all_results_json=[] 29 | 30 | for i in xrange(50): 31 | resFile=model_dir+'/caption_model%d.json'%i 32 | print resFile 33 | 34 | 35 | cocoRes = coco.loadRes(resFile) 36 | # create cocoEval object by taking coco and cocoRes 37 | cocoEval = COCOEvalCap(coco, cocoRes) 38 | 39 | # evaluate on a subset of images by setting 40 | # cocoEval.params['image_id'] = cocoRes.getImgIds() 41 | # please remove this line when evaluating the full validation set 42 | #cocoEval.params['image_id'] = cocoRes.getImgIds() 43 | 44 | #evaluate results 45 | cocoEval.evaluate() 46 | 47 | # print output evaluation scores 48 | results={} 49 | for metric, score in cocoEval.eval.items(): 50 | results[metric]=score 51 | all_results_json.append(results) 52 | 53 | with open(model_dir+'/evaluation_val.json', 'w') as f: 54 | json.dump(all_results_json, f, sort_keys=True, indent=4) 55 | -------------------------------------------------------------------------------- /evalutation_script/generate_caption_val.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This is a script to generate captions for validiation files. 3 | ''' 4 | 5 | # -*- coding: utf-8 -*- 6 | #!/usr/bin/env python 7 | #compatible chiner 1.5 8 | 9 | 10 | import os 11 | os.environ["CHAINER_TYPE_CHECK"] = "0" #to disable type check. 12 | import chainer 13 | #Check che below is False if you disabled type check 14 | #print(chainer.functions.Linear(1,1).type_check_enable) 15 | 16 | import argparse 17 | import numpy as np 18 | import chainer.functions as F 19 | from chainer import cuda 20 | from chainer import Function, FunctionSet, Variable, optimizers, serializers 21 | import pickle 22 | 23 | import glob 24 | import os 25 | import json 26 | 27 | #Settings can be changed by command line arguments 28 | gpu_id=0# GPU ID. if you want to use cpu, -1 29 | model_dir='../experiment1' 30 | 31 | #Override Settings by argument 32 | parser = argparse.ArgumentParser(description=u"caption generation") 33 | parser.add_argument("-g", "--gpu",default=gpu_id, type=int, help=u"GPU ID.CPU is -1") 34 | parser.add_argument("-m", "--modeldir",default=model_dir, type=str, help=u"The directory that have models") 35 | args = parser.parse_args() 36 | gpu_id=args.gpu 37 | model_dir= args.modeldir 38 | 39 | 40 | print('pareparing evaluation') 41 | 42 | 43 | with open('../work/img_features/val_image_id2feature.pkl', 'r') as f: 44 | val_image_id2feature = pickle.load(f) 45 | 46 | #Gpu Setting 47 | if gpu_id >= 0: 48 | xp = cuda.cupy 49 | cuda.get_device(gpu_id).use() 50 | else: 51 | xp=np 52 | 53 | #Basic Setting 54 | image_feature_dim=1024#dimension of image feature 55 | n_units = 512 #number of units per layer 56 | batchsize=1#has to be 1 currently because of implementation. 57 | volatile=False 58 | 59 | 60 | # Prepare dataset 61 | print "loading vocab" 62 | with open('../work/index2token.pkl', 'r') as f: 63 | index2word = pickle.load(f) 64 | 65 | vocab=index2word 66 | 67 | #Model Preparation 68 | print "preparing caption generation models" 69 | model = FunctionSet() 70 | model.img_feature2vec=F.Linear(image_feature_dim, n_units)#CNN(I)の最後のレイヤーに相当。#parameter W,b 71 | model.embed=F.EmbedID(len(vocab), n_units)#W_e*S_tに相当 #parameter W 72 | model.l1_x=F.Linear(n_units, 4 * n_units)#parameter W,b 73 | model.l1_h=F.Linear(n_units, 4 * n_units)#parameter W,b 74 | model.out=F.Linear(n_units, len(vocab))#parameter W,b 75 | 76 | #To GPU 77 | if gpu_id >= 0: 78 | model.to_gpu() 79 | print "done" 80 | 81 | for (image_id,feature) in val_image_id2feature.iteritems(): 82 | x_batch = np.ndarray((1,image_feature_dim), dtype=np.float32) 83 | x_batch[0]=feature 84 | if gpu_id >= 0: 85 | x_batch=cuda.to_gpu(x_batch) 86 | x_batch_chainer = Variable(x_batch,volatile=volatile) 87 | val_image_id2feature[image_id]=x_batch_chainer 88 | 89 | #Define Newtowork (Forward) 90 | 91 | #forward_one_step is after the CNN layer, 92 | #h0 is n_units dimensional vector (embedding) 93 | def forward_one_step(cur_word, state, volatile=True): 94 | x = chainer.Variable(cur_word, volatile) 95 | h0 = model.embed(x) 96 | h1_in = model.l1_x(F.dropout(h0,train=False)) + model.l1_h(state['h1']) 97 | c1, h1 = F.lstm(state['c1'], h1_in) 98 | y = model.out(F.dropout(h1,train=False)) 99 | state = {'c1': c1, 'h1': h1} 100 | return state, y 101 | 102 | def forward_one_step_for_image(img_feature, state, volatile=True): 103 | x = img_feature#img_feature is chainer.variable. 104 | h0 = model.img_feature2vec(x) 105 | h1_in = model.l1_x(F.dropout(h0,train=False)) + model.l1_h(state['h1']) 106 | c1, h1 = F.lstm(state['c1'], h1_in) 107 | y = model.out(F.dropout(h1,train=False))#don't forget to change drop out into non train mode. 108 | state = {'c1': c1, 'h1': h1} 109 | return state, y 110 | 111 | print('evaluation started') 112 | 113 | for model_place in glob.glob(os.path.join(model_dir, 'caption_model*.chainer')): 114 | print model_place 115 | 116 | serializers.load_hdf5(model_place, model)#load model 117 | 118 | results_list=[] 119 | 120 | for image_id in val_image_id2feature: 121 | 122 | img_feature_chainer=val_image_id2feature[image_id] 123 | 124 | genrated_sentence_string='' 125 | 126 | #img_feature_chainer is chainer.variable of extarcted feature. 127 | state = {name: chainer.Variable(xp.zeros((batchsize, n_units),dtype=np.float32),volatile) for name in ('c1', 'h1')} 128 | state, predicted_word = forward_one_step_for_image(img_feature_chainer,state, volatile=volatile) 129 | index=predicted_word.data.argmax(1) 130 | index=cuda.to_cpu(index)[0] 131 | #genrated_sentence_string+=index2word[index] #dont's add it because this is 132 | 133 | for i in xrange(50): 134 | state, predicted_word = forward_one_step(predicted_word.data.argmax(1).astype(np.int32),state, volatile=volatile) 135 | index=predicted_word.data.argmax(1) 136 | index=cuda.to_cpu(index)[0] 137 | if index2word[index]=='': 138 | genrated_sentence_string=genrated_sentence_string.strip() 139 | break; 140 | genrated_sentence_string+=index2word[index]+" " 141 | 142 | line={} 143 | line['image_id']=image_id 144 | line['caption']=genrated_sentence_string 145 | results_list.append(line) 146 | 147 | name, ext = os.path.splitext(model_place) 148 | with open(name+'.json', 'w') as f: 149 | json.dump(results_list, f, sort_keys=True, indent=4) 150 | -------------------------------------------------------------------------------- /experiment1/.gitignore: -------------------------------------------------------------------------------- 1 | #gtignore 以外のファイルを全部無視する。 2 | * 3 | !.gitignore 4 | -------------------------------------------------------------------------------- /images/COCO_val2014_000000185546.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000185546.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000192091.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000192091.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000229948.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000229948.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000241747.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000241747.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000250790.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000250790.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000277533.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000277533.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000285505.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000285505.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000323758.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000323758.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000326128.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000326128.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000397427.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000397427.jpg -------------------------------------------------------------------------------- /images/COCO_val2014_000000553761.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/COCO_val2014_000000553761.jpg -------------------------------------------------------------------------------- /images/test_image.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apple2373/chainer_caption_generation/ee3a504beec5c0a9a84662c883d68375bc41b2d8/images/test_image.jpg -------------------------------------------------------------------------------- /models/.gitignore: -------------------------------------------------------------------------------- 1 | #gtignore 以外のファイルを全部無視する。 2 | * 3 | !.gitignore 4 | -------------------------------------------------------------------------------- /work/.gitignore: -------------------------------------------------------------------------------- 1 | #gtignore 以外のファイルを全部無視する。 2 | * 3 | !.gitignore 4 | --------------------------------------------------------------------------------