├── README.md ├── coco_features ├── README.md └── coco_vgg_IDMap.txt ├── data.txt ├── data ├── train_qa └── val_qa ├── embedding.py ├── embeddings └── README.md ├── examples ├── COCO_val2014_000000000073.jpg ├── COCO_val2014_000000000136.jpg ├── COCO_val2014_000000000196.jpg ├── COCO_val2014_000000000283.jpg ├── COCO_val2014_000000000357.jpg ├── model1.png └── model2.png ├── models.py ├── prepare_data.py ├── question_answer.py ├── test.py ├── train.py └── weights └── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Convolution Neural Network - Image Question Answering 2 | This is a python and keras implementation of the VIS+LSTM visual question answering model. This model is explained in the paper [Exploring Models and Data for Image Question Answering](https://arxiv.org/abs/1505.02074). A second model is also implemented which is similar to the 2-VIS+BLSTM model from the paper mentioned above except that the LSTMs are not bidirectional. 3 | This model has two image feature inputs, at the start and the end of the sentence, with different learned linear transformations. We call it 2-VIS+LSTM. 4 | 5 | Details about the dataset are explained at the [VisualQA website](http://www.visualqa.org/). 6 | 7 | Here is a summary of performance we obtained on both the models. 8 | 9 | | Model | Epochs | Batch Size | Validation Accuracy | 10 | |------------|--------|------------|---------------------| 11 | | VIS+LSTM | 10 | 200 | 53.27% | 12 | | 2-VIS+LSTM | 10 | 200 | 54.01% | 13 | 14 | ## Requirements 15 | 16 | * Python 2.7 17 | * Numpy 18 | * Scipy (for loading pre-computed MS COCO features) 19 | * NLTK (for tokenizer) 20 | * Keras 21 | * Theano 22 | 23 | ## Training 24 | 25 | * The basic usage is `python train.py`. 26 | 27 | * The model to train can be specified using the option `-model`. For example, to train the VIS+LSTM model enter `python train.py -model=1`. Similarly, the 2-VIS+LSTM model can be trained using `python train.py -model=2`. If no model is specified, model 1 is selected. 28 | 29 | * The batch size and the number of epochs can also be specified using the options `-num_epochs` and `-batch_size`. The default batch size and number of epochs are 200 and 25 respectively. 30 | 31 | * To train 2-VIS+LSTM with a batch size of 100 for 10 epochs, we would use: `python train.py -model=2 -batch_size=100 -num_epochs=10`. 32 | 33 | ## Models 34 | 35 | ### VIS+LSTM 36 | 37 | 38 | 39 | ### 2-VIS+LSTM 40 | 41 | 42 | 43 | ## Prediction 44 | 45 | * Q&A can be performed on any image using the script `question_answer.py`. 46 | 47 | * The options `-question` and `-image` are used to specify the question and address of the image respectively. The model to use for the prediction can be specified using `-model`. By default, model 2 is selected. 48 | 49 | * An example of usage is: `python question_answer.py -image="examples/COCO_val2014_000000000136.jpg" -question="Which animal is this?" -model=2` 50 | 51 | Here are some examples of predictions using the 2-VIS+LSTM model. 52 | 53 | | Image | Question | Top Answers (left to right) | 54 | |----------------------------------------------------|----------------------------|-----------------------------| 55 | | | Which animal is this? | giraffe, cat, bear | 56 | | | Which vehicle is this? | motorcycle, taxi, train | 57 | | | How many dishes are there? | 5, 3, 2 | 58 | | | What is in the bottle? | water, beer, wine | 59 | | | Which sport is this? | tennis, baseball, frisbee | 60 | -------------------------------------------------------------------------------- /coco_features/README.md: -------------------------------------------------------------------------------- 1 | Download the precomputed MS COCO features from http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip and extract them in this folder. 2 | -------------------------------------------------------------------------------- /data.txt: -------------------------------------------------------------------------------- 1 | 'Tue Dec 20 00:00:00 2016 -0400942519' ; git add data.txt; GIT_AUTHOR_DATE='Tue Dec 20 00:00:00 2016 -0400' GIT_COMMITTER_DATE='Tue Dec 20 00:00:00 2016 -0400' git commit -m 'Update CNN'; git push; 2 | -------------------------------------------------------------------------------- /data/train_qa: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/data/train_qa -------------------------------------------------------------------------------- /data/val_qa: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/data/val_qa -------------------------------------------------------------------------------- /embedding.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import h5py 3 | import pickle 4 | 5 | def load(): 6 | path = 'embeddings/embedding_matrix.h5' 7 | with h5py.File(path,'r') as hf: 8 | data = hf.get('embedding_matrix') 9 | embedding_matrix = np.array(data) 10 | return embedding_matrix 11 | 12 | def load_idx(): 13 | path = 'embeddings/word_idx' 14 | with open(path,'r') as file: 15 | word_idx = pickle.load(file) 16 | return word_idx 17 | 18 | def create(glove_path): 19 | embedding_matrix_path = 'embeddings/embedding_matrix.h5' 20 | word_idx_path = 'embeddings/word_idx' 21 | embeddings = {} 22 | word_idx = {} 23 | 24 | with open(glove_path,'r') as f: 25 | for i, line in enumerate(f): 26 | values = line.split() 27 | word = values[0] 28 | coefs = np.asarray(values[1:],dtype='float32') 29 | embeddings[word] = coefs 30 | word_idx[word] = i+1 31 | 32 | num_words = len(word_idx) 33 | embedding_matrix = np.zeros((1+num_words,300)) 34 | 35 | for i, word in enumerate(word_idx.keys()): 36 | embedding_matrix[i+1] = embeddings[word] 37 | 38 | with h5py.File(embedding_matrix_path, 'w') as hf: 39 | hf.create_dataset('embedding_matrix',data=embedding_matrix) 40 | 41 | with open(word_idx_path,'w') as f: 42 | pickle.dump(word_idx,f) 43 | 44 | def main(): 45 | parser = argparse.ArgumentParser() 46 | parser.add_argument('-address', type=str, required=True) 47 | args = parser.parse_args() 48 | print('Preparing embeddings ...') 49 | create(args.address) 50 | 51 | if __name__ == '__main__': main() 52 | -------------------------------------------------------------------------------- /embeddings/README.md: -------------------------------------------------------------------------------- 1 | ## Instructions for preparing embeddings 2 | 3 | Download and extract the pretrained common crawl 300D word vectors from http://nlp.stanford.edu/data/glove.840B.300d.zip. 4 | Use the script `embedding.py` to generate the embedding matrix and word indices. The usage is as follows: 5 | 6 | ``` 7 | $ python embedding.py -address address-of-extracted-glove-file 8 | ``` 9 | -------------------------------------------------------------------------------- /examples/COCO_val2014_000000000073.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000073.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000000136.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000136.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000000196.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000196.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000000283.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000283.jpg -------------------------------------------------------------------------------- /examples/COCO_val2014_000000000357.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000357.jpg -------------------------------------------------------------------------------- /examples/model1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/model1.png -------------------------------------------------------------------------------- /examples/model2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/model2.png -------------------------------------------------------------------------------- /models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import embedding 3 | from keras.models import Sequential 4 | from keras.layers import Dense, Embedding, LSTM, Merge, Reshape, Dropout, Convolution2D, MaxPooling2D, ZeroPadding2D, Flatten 5 | 6 | def vis_lstm(): 7 | embedding_matrix = embedding.load() 8 | embedding_model = Sequential() 9 | embedding_model.add(Embedding( 10 | embedding_matrix.shape[0], 11 | embedding_matrix.shape[1], 12 | weights = [embedding_matrix], 13 | trainable = False)) 14 | 15 | image_model = Sequential() 16 | image_model.add(Dense( 17 | embedding_matrix.shape[1], 18 | input_dim=4096, 19 | activation='linear')) 20 | image_model.add(Reshape((1,embedding_matrix.shape[1]))) 21 | 22 | main_model = Sequential() 23 | main_model.add(Merge( 24 | [image_model,embedding_model], 25 | mode = 'concat', 26 | concat_axis = 1)) 27 | main_model.add(LSTM(1001)) 28 | main_model.add(Dropout(0.5)) 29 | main_model.add(Dense(1001,activation='softmax')) 30 | 31 | return main_model 32 | 33 | def vis_lstm_2(): 34 | embedding_matrix = embedding.load() 35 | embedding_model = Sequential() 36 | embedding_model.add(Embedding( 37 | embedding_matrix.shape[0], 38 | embedding_matrix.shape[1], 39 | weights = [embedding_matrix], 40 | trainable = False)) 41 | 42 | image_model_1 = Sequential() 43 | image_model_1.add(Dense( 44 | embedding_matrix.shape[1], 45 | input_dim=4096, 46 | activation='linear')) 47 | image_model_1.add(Reshape((1,embedding_matrix.shape[1]))) 48 | 49 | image_model_2 = Sequential() 50 | image_model_2.add(Dense( 51 | embedding_matrix.shape[1], 52 | input_dim=4096, 53 | activation='linear')) 54 | image_model_2.add(Reshape((1,embedding_matrix.shape[1]))) 55 | 56 | main_model = Sequential() 57 | main_model.add(Merge( 58 | [image_model_1,embedding_model,image_model_2], 59 | mode = 'concat', 60 | concat_axis = 1)) 61 | main_model.add(LSTM(1001)) 62 | main_model.add(Dropout(0.5)) 63 | main_model.add(Dense(1001,activation='softmax')) 64 | 65 | return main_model 66 | 67 | def VGG_16(weights_path=None): 68 | model = Sequential() 69 | model.add(ZeroPadding2D((1,1),input_shape=(3,224,224))) 70 | model.add(Convolution2D(64, 3, 3, activation='relu')) 71 | model.add(ZeroPadding2D((1,1))) 72 | model.add(Convolution2D(64, 3, 3, activation='relu')) 73 | model.add(MaxPooling2D((2,2), strides =(2,2))) 74 | 75 | model.add(ZeroPadding2D((1,1))) 76 | model.add(Convolution2D(128, 3, 3, activation='relu')) 77 | model.add(ZeroPadding2D((1,1))) 78 | model.add(Convolution2D(128, 3, 3, activation='relu')) 79 | model.add(MaxPooling2D((2,2), strides =(2,2))) 80 | 81 | model.add(ZeroPadding2D((1,1))) 82 | model.add(Convolution2D(256, 3, 3, activation='relu')) 83 | model.add(ZeroPadding2D((1,1))) 84 | model.add(Convolution2D(256, 3, 3, activation='relu')) 85 | model.add(ZeroPadding2D((1,1))) 86 | model.add(Convolution2D(256, 3, 3, activation='relu')) 87 | model.add(MaxPooling2D((2,2), strides =(2,2))) 88 | 89 | model.add(ZeroPadding2D((1,1))) 90 | model.add(Convolution2D(512, 3, 3, activation='relu')) 91 | model.add(ZeroPadding2D((1,1))) 92 | model.add(Convolution2D(512, 3, 3, activation='relu')) 93 | model.add(ZeroPadding2D((1,1))) 94 | model.add(Convolution2D(512, 3, 3, activation='relu')) 95 | model.add(MaxPooling2D((2,2), strides =(2,2))) 96 | 97 | model.add(ZeroPadding2D((1,1))) 98 | model.add(Convolution2D(512, 3, 3, activation='relu')) 99 | model.add(ZeroPadding2D((1,1))) 100 | model.add(Convolution2D(512, 3, 3, activation='relu')) 101 | model.add(ZeroPadding2D((1,1))) 102 | model.add(Convolution2D(512, 3, 3, activation='relu')) 103 | model.add(MaxPooling2D((2,2), strides =(2,2))) 104 | 105 | model.add(Flatten()) 106 | model.add(Dense(4096, activation='relu')) 107 | model.add(Dropout(0.5)) 108 | model.add(Dense(4096, activation='relu')) 109 | model.add(Dropout(0.5)) 110 | model.add(Dense(1000, activation='softmax')) 111 | 112 | if weights_path: 113 | model.load_weights(weights_path) 114 | 115 | return model 116 | 117 | 118 | 119 | -------------------------------------------------------------------------------- /prepare_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import embedding as ebd 4 | import operator 5 | import sys 6 | import scipy as sc 7 | from collections import defaultdict 8 | from nltk import word_tokenize 9 | from keras.preprocessing.sequence import pad_sequences 10 | 11 | def int_to_answers(): 12 | data_path = 'data/train_qa' 13 | df = pd.read_pickle(data_path) 14 | answers = df[['multiple_choice_answer']].values.tolist() 15 | freq = defaultdict(int) 16 | for answer in answers: 17 | freq[answer[0].lower()] += 1 18 | int_to_answer = sorted(freq.items(),key=operator.itemgetter(1),reverse=True)[0:1000] 19 | int_to_answer = [answer[0] for answer in int_to_answer] 20 | return int_to_answer 21 | 22 | top_answers = int_to_answers() 23 | 24 | def answers_to_onehot(): 25 | top_answers = int_to_answers() 26 | answer_to_onehot = {} 27 | for i, word in enumerate(top_answers): 28 | onehot = np.zeros(1001) 29 | onehot[i] = 1.0 30 | answer_to_onehot[word] = onehot 31 | return answer_to_onehot 32 | 33 | answer_to_onehot_dict = answers_to_onehot() 34 | 35 | def get_answers_matrix(split): 36 | if split == 'train': 37 | data_path = 'data/train_qa' 38 | elif split == 'val': 39 | data_path = 'data/val_qa' 40 | else: 41 | print('Invalid split!') 42 | sys.exit() 43 | 44 | df = pd.read_pickle(data_path) 45 | answers = df[['multiple_choice_answer']].values.tolist() 46 | answer_matrix = np.zeros((len(answers),1001)) 47 | default_onehot = np.zeros(1001) 48 | default_onehot[1000] = 1.0 49 | 50 | for i, answer in enumerate(answers): 51 | answer_matrix[i] = answer_to_onehot_dict.get(answer[0].lower(),default_onehot) 52 | 53 | return answer_matrix 54 | 55 | def get_questions_matrix(split): 56 | if split == 'train': 57 | data_path = 'data/train_qa' 58 | elif split == 'val': 59 | data_path = 'data/val_qa' 60 | else: 61 | print('Invalid split!') 62 | sys.exit() 63 | 64 | df = pd.read_pickle(data_path) 65 | questions = df[['question']].values.tolist() 66 | word_idx = ebd.load_idx() 67 | seq_list = [] 68 | 69 | for question in questions: 70 | words = word_tokenize(question[0]) 71 | seq = [] 72 | for word in words: 73 | seq.append(word_idx.get(word,0)) 74 | seq_list.append(seq) 75 | question_matrix = pad_sequences(seq_list) 76 | 77 | return question_matrix 78 | 79 | def get_coco_features(split): 80 | if split == 'train': 81 | data_path = 'data/train_qa' 82 | elif split == 'val': 83 | data_path = 'data/val_qa' 84 | else: 85 | print('Invalid split!') 86 | sys.exit() 87 | 88 | id_map_path = 'coco_features/coco_vgg_IDMap.txt' 89 | features_path = 'coco_features/vgg_feats.mat' 90 | 91 | img_labels = pd.read_pickle(data_path)[['image_id']].values.tolist() 92 | img_ids = open(id_map_path).read().splitlines() 93 | features_struct = sc.io.loadmat(features_path) 94 | 95 | id_map = {} 96 | for ids in img_ids: 97 | ids_split = ids.split() 98 | id_map[int(ids_split[0])] = int(ids_split[1]) 99 | 100 | VGGfeatures = features_struct['feats'] 101 | nb_dimensions = VGGfeatures.shape[0] 102 | nb_images = len(img_labels) 103 | image_matrix = np.zeros((nb_images,nb_dimensions)) 104 | 105 | for i in range(nb_images): 106 | image_matrix[i,:] = VGGfeatures[:,id_map[img_labels[i][0]]] 107 | 108 | return image_matrix 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | -------------------------------------------------------------------------------- /question_answer.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import embedding as ebd 3 | import prepare_data 4 | import models 5 | import argparse 6 | import sys 7 | import keras.backend as K 8 | from nltk import word_tokenize 9 | from keras.applications.vgg16 import preprocess_input 10 | from keras.preprocessing import image 11 | from keras.models import load_model 12 | 13 | def extract_image_features(img_path): 14 | model = models.VGG_16('weights/vgg16_weights.h5') 15 | img = image.load_img(img_path,target_size=(224,224)) 16 | x = image.img_to_array(img) 17 | x = np.expand_dims(x,axis=0) 18 | x = preprocess_input(x) 19 | last_layer_output = K.function([model.layers[0].input,K.learning_phase()], 20 | [model.layers[-1].input]) 21 | features = last_layer_output([x,0])[0] 22 | return features 23 | 24 | def preprocess_question(question): 25 | word_idx = ebd.load_idx() 26 | tokens = word_tokenize(question) 27 | seq = [] 28 | for token in tokens: 29 | seq.append(word_idx.get(token,0)) 30 | seq = np.reshape(seq,(1,len(seq))) 31 | return seq 32 | 33 | def generate_answer(img_path, question, model): 34 | model_path = 'weights/model_'+str(model)+'.h5' 35 | model = load_model(model_path) 36 | img_features = extract_image_features(img_path) 37 | seq = preprocess_question(question) 38 | if model == 1: 39 | x = [img_features, seq] 40 | else: 41 | x = [img_features, seq, img_features] 42 | probabilities = model.predict(x)[0] 43 | answers = np.argsort(probabilities[:1000]) 44 | top_answers = [prepare_data.top_answers[answers[-1]], 45 | prepare_data.top_answers[answers[-2]], 46 | prepare_data.top_answers[answers[-3]]] 47 | 48 | return top_answers 49 | 50 | def main(): 51 | parser = argparse.ArgumentParser() 52 | parser.add_argument('-image', type=str, required=True) 53 | parser.add_argument('-question', type=str, required=True) 54 | parser.add_argument('-model', type=int, default=2) 55 | args = parser.parse_args() 56 | if args.model != 1 and args.model != 2: 57 | print('Invalid model selection.') 58 | sys.exit() 59 | top_answers = generate_answer(args.image, args.question, args.model) 60 | print('Top answers: %s, %s, %s.' % (top_answers[0],top_answers[1],top_answers[2])) 61 | 62 | if __name__ == '__main__':main() 63 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | from datetime import date, timedelta 2 | from random import randint 3 | from time import sleep 4 | import sys 5 | import subprocess 6 | import os 7 | 8 | 9 | def get_date_string(n, startdate): 10 | d = startdate - timedelta(days=n) 11 | rtn = d.strftime("%a %b %d %X %Y %z -0400") 12 | return rtn 13 | 14 | # main app 15 | def main(argv): 16 | if len(argv) < 1 or len(argv) > 2: 17 | print "Error: Bad input." 18 | sys.exit(1) 19 | n = int(argv[0]) 20 | if len(argv) == 1: 21 | startdate = date.today() 22 | if len(argv) == 2: 23 | startdate = date(int(argv[1][0:4]), int(argv[1][5:7]), int(argv[1][8:10])) 24 | i = 0 25 | while i <= n: 26 | curdate = get_date_string(i, startdate) 27 | num_commits = randint(1, 10) 28 | for commit in range(0, num_commits): 29 | subprocess.call("echo '" + curdate + str(randint(0, 1000000)) +"' > data.txt; git add data.txt; GIT_AUTHOR_DATE='" + curdate + "' GIT_COMMITTER_DATE='" + curdate + "' git commit -m 'Update CNN'; git push;", shell=True) 30 | sleep(.5) 31 | i += 1 32 | subprocess.call("git rm data.txt; git commit -m 'Reconfigure Model'; git push;", shell=True) 33 | 34 | if __name__ == "__main__": 35 | main(sys.argv[1:]) 36 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import prepare_data 3 | import models 4 | import argparse 5 | import sys 6 | 7 | def main(): 8 | parser = argparse.ArgumentParser() 9 | parser.add_argument('-num_epochs', type=int, default=25) 10 | parser.add_argument('-batch_size', type=int, default=200) 11 | parser.add_argument('-model', type=int, default=1) 12 | args = parser.parse_args() 13 | 14 | print('Loading questions ...') 15 | questions_train = prepare_data.get_questions_matrix('train') 16 | questions_val = prepare_data.get_questions_matrix('val') 17 | print('Loading answers ...') 18 | answers_train = prepare_data.get_answers_matrix('train') 19 | answers_val = prepare_data.get_answers_matrix('val') 20 | print('Loading image features ...') 21 | img_features_train = prepare_data.get_coco_features('train') 22 | img_features_val = prepare_data.get_coco_features('val') 23 | print('Creating model ...') 24 | 25 | if args.model == 1: 26 | model = models.vis_lstm() 27 | X_train = [img_features_train, questions_train] 28 | X_val = [img_features_val, questions_val] 29 | model_path = 'weights/model_1.h5' 30 | elif args.model == 2: 31 | model = models.vis_lstm_2() 32 | X_train = [img_features_train, questions_train, img_features_train] 33 | X_val = [img_features_val, questions_val, img_features_val] 34 | model_path = 'weights/model_2.h5' 35 | else: 36 | print('Invalid model selection!\nAvailable choices: 1 for vis-lstm and 2 for 2-vis-lstm.') 37 | sys.exit() 38 | 39 | model.compile(optimizer='adam', 40 | loss='categorical_crossentropy', 41 | metrics=['accuracy']) 42 | 43 | model.fit(X_train,answers_train, 44 | nb_epoch=args.num_epochs, 45 | batch_size=args.batch_size, 46 | validation_data=(X_val,answers_val), 47 | verbose=1) 48 | 49 | model.save(model_path) 50 | 51 | if __name__ == '__main__':main() 52 | 53 | -------------------------------------------------------------------------------- /weights/README.md: -------------------------------------------------------------------------------- 1 | Download the pretrained VGG 16 weights from https://drive.google.com/file/d/0Bz7KyqmuGsilT0J5dmRCM0ROVHc/view and place them in this folder. This is required for making predictions on your own images. 2 | 3 | --------------------------------------------------------------------------------