├── README.md
├── coco_features
├── README.md
└── coco_vgg_IDMap.txt
├── data.txt
├── data
├── train_qa
└── val_qa
├── embedding.py
├── embeddings
└── README.md
├── examples
├── COCO_val2014_000000000073.jpg
├── COCO_val2014_000000000136.jpg
├── COCO_val2014_000000000196.jpg
├── COCO_val2014_000000000283.jpg
├── COCO_val2014_000000000357.jpg
├── model1.png
└── model2.png
├── models.py
├── prepare_data.py
├── question_answer.py
├── test.py
├── train.py
└── weights
└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Convolution Neural Network - Image Question Answering
2 | This is a python and keras implementation of the VIS+LSTM visual question answering model. This model is explained in the paper [Exploring Models and Data for Image Question Answering](https://arxiv.org/abs/1505.02074). A second model is also implemented which is similar to the 2-VIS+BLSTM model from the paper mentioned above except that the LSTMs are not bidirectional.
3 | This model has two image feature inputs, at the start and the end of the sentence, with different learned linear transformations. We call it 2-VIS+LSTM.
4 |
5 | Details about the dataset are explained at the [VisualQA website](http://www.visualqa.org/).
6 |
7 | Here is a summary of performance we obtained on both the models.
8 |
9 | | Model | Epochs | Batch Size | Validation Accuracy |
10 | |------------|--------|------------|---------------------|
11 | | VIS+LSTM | 10 | 200 | 53.27% |
12 | | 2-VIS+LSTM | 10 | 200 | 54.01% |
13 |
14 | ## Requirements
15 |
16 | * Python 2.7
17 | * Numpy
18 | * Scipy (for loading pre-computed MS COCO features)
19 | * NLTK (for tokenizer)
20 | * Keras
21 | * Theano
22 |
23 | ## Training
24 |
25 | * The basic usage is `python train.py`.
26 |
27 | * The model to train can be specified using the option `-model`. For example, to train the VIS+LSTM model enter `python train.py -model=1`. Similarly, the 2-VIS+LSTM model can be trained using `python train.py -model=2`. If no model is specified, model 1 is selected.
28 |
29 | * The batch size and the number of epochs can also be specified using the options `-num_epochs` and `-batch_size`. The default batch size and number of epochs are 200 and 25 respectively.
30 |
31 | * To train 2-VIS+LSTM with a batch size of 100 for 10 epochs, we would use: `python train.py -model=2 -batch_size=100 -num_epochs=10`.
32 |
33 | ## Models
34 |
35 | ### VIS+LSTM
36 |
37 |
38 |
39 | ### 2-VIS+LSTM
40 |
41 |
42 |
43 | ## Prediction
44 |
45 | * Q&A can be performed on any image using the script `question_answer.py`.
46 |
47 | * The options `-question` and `-image` are used to specify the question and address of the image respectively. The model to use for the prediction can be specified using `-model`. By default, model 2 is selected.
48 |
49 | * An example of usage is: `python question_answer.py -image="examples/COCO_val2014_000000000136.jpg" -question="Which animal is this?" -model=2`
50 |
51 | Here are some examples of predictions using the 2-VIS+LSTM model.
52 |
53 | | Image | Question | Top Answers (left to right) |
54 | |----------------------------------------------------|----------------------------|-----------------------------|
55 | |
| Which animal is this? | giraffe, cat, bear |
56 | |
| Which vehicle is this? | motorcycle, taxi, train |
57 | |
| How many dishes are there? | 5, 3, 2 |
58 | |
| What is in the bottle? | water, beer, wine |
59 | |
| Which sport is this? | tennis, baseball, frisbee |
60 |
--------------------------------------------------------------------------------
/coco_features/README.md:
--------------------------------------------------------------------------------
1 | Download the precomputed MS COCO features from http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip and extract them in this folder.
2 |
--------------------------------------------------------------------------------
/data.txt:
--------------------------------------------------------------------------------
1 | 'Tue Dec 20 00:00:00 2016 -0400942519' ; git add data.txt; GIT_AUTHOR_DATE='Tue Dec 20 00:00:00 2016 -0400' GIT_COMMITTER_DATE='Tue Dec 20 00:00:00 2016 -0400' git commit -m 'Update CNN'; git push;
2 |
--------------------------------------------------------------------------------
/data/train_qa:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/data/train_qa
--------------------------------------------------------------------------------
/data/val_qa:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/data/val_qa
--------------------------------------------------------------------------------
/embedding.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import h5py
3 | import pickle
4 |
5 | def load():
6 | path = 'embeddings/embedding_matrix.h5'
7 | with h5py.File(path,'r') as hf:
8 | data = hf.get('embedding_matrix')
9 | embedding_matrix = np.array(data)
10 | return embedding_matrix
11 |
12 | def load_idx():
13 | path = 'embeddings/word_idx'
14 | with open(path,'r') as file:
15 | word_idx = pickle.load(file)
16 | return word_idx
17 |
18 | def create(glove_path):
19 | embedding_matrix_path = 'embeddings/embedding_matrix.h5'
20 | word_idx_path = 'embeddings/word_idx'
21 | embeddings = {}
22 | word_idx = {}
23 |
24 | with open(glove_path,'r') as f:
25 | for i, line in enumerate(f):
26 | values = line.split()
27 | word = values[0]
28 | coefs = np.asarray(values[1:],dtype='float32')
29 | embeddings[word] = coefs
30 | word_idx[word] = i+1
31 |
32 | num_words = len(word_idx)
33 | embedding_matrix = np.zeros((1+num_words,300))
34 |
35 | for i, word in enumerate(word_idx.keys()):
36 | embedding_matrix[i+1] = embeddings[word]
37 |
38 | with h5py.File(embedding_matrix_path, 'w') as hf:
39 | hf.create_dataset('embedding_matrix',data=embedding_matrix)
40 |
41 | with open(word_idx_path,'w') as f:
42 | pickle.dump(word_idx,f)
43 |
44 | def main():
45 | parser = argparse.ArgumentParser()
46 | parser.add_argument('-address', type=str, required=True)
47 | args = parser.parse_args()
48 | print('Preparing embeddings ...')
49 | create(args.address)
50 |
51 | if __name__ == '__main__': main()
52 |
--------------------------------------------------------------------------------
/embeddings/README.md:
--------------------------------------------------------------------------------
1 | ## Instructions for preparing embeddings
2 |
3 | Download and extract the pretrained common crawl 300D word vectors from http://nlp.stanford.edu/data/glove.840B.300d.zip.
4 | Use the script `embedding.py` to generate the embedding matrix and word indices. The usage is as follows:
5 |
6 | ```
7 | $ python embedding.py -address address-of-extracted-glove-file
8 | ```
9 |
--------------------------------------------------------------------------------
/examples/COCO_val2014_000000000073.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000073.jpg
--------------------------------------------------------------------------------
/examples/COCO_val2014_000000000136.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000136.jpg
--------------------------------------------------------------------------------
/examples/COCO_val2014_000000000196.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000196.jpg
--------------------------------------------------------------------------------
/examples/COCO_val2014_000000000283.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000283.jpg
--------------------------------------------------------------------------------
/examples/COCO_val2014_000000000357.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/COCO_val2014_000000000357.jpg
--------------------------------------------------------------------------------
/examples/model1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/model1.png
--------------------------------------------------------------------------------
/examples/model2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayushoriginal/NeuralNetwork-ImageQA/ea83adee934b00afef38f4fefc1d89078ba7709e/examples/model2.png
--------------------------------------------------------------------------------
/models.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import embedding
3 | from keras.models import Sequential
4 | from keras.layers import Dense, Embedding, LSTM, Merge, Reshape, Dropout, Convolution2D, MaxPooling2D, ZeroPadding2D, Flatten
5 |
6 | def vis_lstm():
7 | embedding_matrix = embedding.load()
8 | embedding_model = Sequential()
9 | embedding_model.add(Embedding(
10 | embedding_matrix.shape[0],
11 | embedding_matrix.shape[1],
12 | weights = [embedding_matrix],
13 | trainable = False))
14 |
15 | image_model = Sequential()
16 | image_model.add(Dense(
17 | embedding_matrix.shape[1],
18 | input_dim=4096,
19 | activation='linear'))
20 | image_model.add(Reshape((1,embedding_matrix.shape[1])))
21 |
22 | main_model = Sequential()
23 | main_model.add(Merge(
24 | [image_model,embedding_model],
25 | mode = 'concat',
26 | concat_axis = 1))
27 | main_model.add(LSTM(1001))
28 | main_model.add(Dropout(0.5))
29 | main_model.add(Dense(1001,activation='softmax'))
30 |
31 | return main_model
32 |
33 | def vis_lstm_2():
34 | embedding_matrix = embedding.load()
35 | embedding_model = Sequential()
36 | embedding_model.add(Embedding(
37 | embedding_matrix.shape[0],
38 | embedding_matrix.shape[1],
39 | weights = [embedding_matrix],
40 | trainable = False))
41 |
42 | image_model_1 = Sequential()
43 | image_model_1.add(Dense(
44 | embedding_matrix.shape[1],
45 | input_dim=4096,
46 | activation='linear'))
47 | image_model_1.add(Reshape((1,embedding_matrix.shape[1])))
48 |
49 | image_model_2 = Sequential()
50 | image_model_2.add(Dense(
51 | embedding_matrix.shape[1],
52 | input_dim=4096,
53 | activation='linear'))
54 | image_model_2.add(Reshape((1,embedding_matrix.shape[1])))
55 |
56 | main_model = Sequential()
57 | main_model.add(Merge(
58 | [image_model_1,embedding_model,image_model_2],
59 | mode = 'concat',
60 | concat_axis = 1))
61 | main_model.add(LSTM(1001))
62 | main_model.add(Dropout(0.5))
63 | main_model.add(Dense(1001,activation='softmax'))
64 |
65 | return main_model
66 |
67 | def VGG_16(weights_path=None):
68 | model = Sequential()
69 | model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
70 | model.add(Convolution2D(64, 3, 3, activation='relu'))
71 | model.add(ZeroPadding2D((1,1)))
72 | model.add(Convolution2D(64, 3, 3, activation='relu'))
73 | model.add(MaxPooling2D((2,2), strides =(2,2)))
74 |
75 | model.add(ZeroPadding2D((1,1)))
76 | model.add(Convolution2D(128, 3, 3, activation='relu'))
77 | model.add(ZeroPadding2D((1,1)))
78 | model.add(Convolution2D(128, 3, 3, activation='relu'))
79 | model.add(MaxPooling2D((2,2), strides =(2,2)))
80 |
81 | model.add(ZeroPadding2D((1,1)))
82 | model.add(Convolution2D(256, 3, 3, activation='relu'))
83 | model.add(ZeroPadding2D((1,1)))
84 | model.add(Convolution2D(256, 3, 3, activation='relu'))
85 | model.add(ZeroPadding2D((1,1)))
86 | model.add(Convolution2D(256, 3, 3, activation='relu'))
87 | model.add(MaxPooling2D((2,2), strides =(2,2)))
88 |
89 | model.add(ZeroPadding2D((1,1)))
90 | model.add(Convolution2D(512, 3, 3, activation='relu'))
91 | model.add(ZeroPadding2D((1,1)))
92 | model.add(Convolution2D(512, 3, 3, activation='relu'))
93 | model.add(ZeroPadding2D((1,1)))
94 | model.add(Convolution2D(512, 3, 3, activation='relu'))
95 | model.add(MaxPooling2D((2,2), strides =(2,2)))
96 |
97 | model.add(ZeroPadding2D((1,1)))
98 | model.add(Convolution2D(512, 3, 3, activation='relu'))
99 | model.add(ZeroPadding2D((1,1)))
100 | model.add(Convolution2D(512, 3, 3, activation='relu'))
101 | model.add(ZeroPadding2D((1,1)))
102 | model.add(Convolution2D(512, 3, 3, activation='relu'))
103 | model.add(MaxPooling2D((2,2), strides =(2,2)))
104 |
105 | model.add(Flatten())
106 | model.add(Dense(4096, activation='relu'))
107 | model.add(Dropout(0.5))
108 | model.add(Dense(4096, activation='relu'))
109 | model.add(Dropout(0.5))
110 | model.add(Dense(1000, activation='softmax'))
111 |
112 | if weights_path:
113 | model.load_weights(weights_path)
114 |
115 | return model
116 |
117 |
118 |
119 |
--------------------------------------------------------------------------------
/prepare_data.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import embedding as ebd
4 | import operator
5 | import sys
6 | import scipy as sc
7 | from collections import defaultdict
8 | from nltk import word_tokenize
9 | from keras.preprocessing.sequence import pad_sequences
10 |
11 | def int_to_answers():
12 | data_path = 'data/train_qa'
13 | df = pd.read_pickle(data_path)
14 | answers = df[['multiple_choice_answer']].values.tolist()
15 | freq = defaultdict(int)
16 | for answer in answers:
17 | freq[answer[0].lower()] += 1
18 | int_to_answer = sorted(freq.items(),key=operator.itemgetter(1),reverse=True)[0:1000]
19 | int_to_answer = [answer[0] for answer in int_to_answer]
20 | return int_to_answer
21 |
22 | top_answers = int_to_answers()
23 |
24 | def answers_to_onehot():
25 | top_answers = int_to_answers()
26 | answer_to_onehot = {}
27 | for i, word in enumerate(top_answers):
28 | onehot = np.zeros(1001)
29 | onehot[i] = 1.0
30 | answer_to_onehot[word] = onehot
31 | return answer_to_onehot
32 |
33 | answer_to_onehot_dict = answers_to_onehot()
34 |
35 | def get_answers_matrix(split):
36 | if split == 'train':
37 | data_path = 'data/train_qa'
38 | elif split == 'val':
39 | data_path = 'data/val_qa'
40 | else:
41 | print('Invalid split!')
42 | sys.exit()
43 |
44 | df = pd.read_pickle(data_path)
45 | answers = df[['multiple_choice_answer']].values.tolist()
46 | answer_matrix = np.zeros((len(answers),1001))
47 | default_onehot = np.zeros(1001)
48 | default_onehot[1000] = 1.0
49 |
50 | for i, answer in enumerate(answers):
51 | answer_matrix[i] = answer_to_onehot_dict.get(answer[0].lower(),default_onehot)
52 |
53 | return answer_matrix
54 |
55 | def get_questions_matrix(split):
56 | if split == 'train':
57 | data_path = 'data/train_qa'
58 | elif split == 'val':
59 | data_path = 'data/val_qa'
60 | else:
61 | print('Invalid split!')
62 | sys.exit()
63 |
64 | df = pd.read_pickle(data_path)
65 | questions = df[['question']].values.tolist()
66 | word_idx = ebd.load_idx()
67 | seq_list = []
68 |
69 | for question in questions:
70 | words = word_tokenize(question[0])
71 | seq = []
72 | for word in words:
73 | seq.append(word_idx.get(word,0))
74 | seq_list.append(seq)
75 | question_matrix = pad_sequences(seq_list)
76 |
77 | return question_matrix
78 |
79 | def get_coco_features(split):
80 | if split == 'train':
81 | data_path = 'data/train_qa'
82 | elif split == 'val':
83 | data_path = 'data/val_qa'
84 | else:
85 | print('Invalid split!')
86 | sys.exit()
87 |
88 | id_map_path = 'coco_features/coco_vgg_IDMap.txt'
89 | features_path = 'coco_features/vgg_feats.mat'
90 |
91 | img_labels = pd.read_pickle(data_path)[['image_id']].values.tolist()
92 | img_ids = open(id_map_path).read().splitlines()
93 | features_struct = sc.io.loadmat(features_path)
94 |
95 | id_map = {}
96 | for ids in img_ids:
97 | ids_split = ids.split()
98 | id_map[int(ids_split[0])] = int(ids_split[1])
99 |
100 | VGGfeatures = features_struct['feats']
101 | nb_dimensions = VGGfeatures.shape[0]
102 | nb_images = len(img_labels)
103 | image_matrix = np.zeros((nb_images,nb_dimensions))
104 |
105 | for i in range(nb_images):
106 | image_matrix[i,:] = VGGfeatures[:,id_map[img_labels[i][0]]]
107 |
108 | return image_matrix
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
--------------------------------------------------------------------------------
/question_answer.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import embedding as ebd
3 | import prepare_data
4 | import models
5 | import argparse
6 | import sys
7 | import keras.backend as K
8 | from nltk import word_tokenize
9 | from keras.applications.vgg16 import preprocess_input
10 | from keras.preprocessing import image
11 | from keras.models import load_model
12 |
13 | def extract_image_features(img_path):
14 | model = models.VGG_16('weights/vgg16_weights.h5')
15 | img = image.load_img(img_path,target_size=(224,224))
16 | x = image.img_to_array(img)
17 | x = np.expand_dims(x,axis=0)
18 | x = preprocess_input(x)
19 | last_layer_output = K.function([model.layers[0].input,K.learning_phase()],
20 | [model.layers[-1].input])
21 | features = last_layer_output([x,0])[0]
22 | return features
23 |
24 | def preprocess_question(question):
25 | word_idx = ebd.load_idx()
26 | tokens = word_tokenize(question)
27 | seq = []
28 | for token in tokens:
29 | seq.append(word_idx.get(token,0))
30 | seq = np.reshape(seq,(1,len(seq)))
31 | return seq
32 |
33 | def generate_answer(img_path, question, model):
34 | model_path = 'weights/model_'+str(model)+'.h5'
35 | model = load_model(model_path)
36 | img_features = extract_image_features(img_path)
37 | seq = preprocess_question(question)
38 | if model == 1:
39 | x = [img_features, seq]
40 | else:
41 | x = [img_features, seq, img_features]
42 | probabilities = model.predict(x)[0]
43 | answers = np.argsort(probabilities[:1000])
44 | top_answers = [prepare_data.top_answers[answers[-1]],
45 | prepare_data.top_answers[answers[-2]],
46 | prepare_data.top_answers[answers[-3]]]
47 |
48 | return top_answers
49 |
50 | def main():
51 | parser = argparse.ArgumentParser()
52 | parser.add_argument('-image', type=str, required=True)
53 | parser.add_argument('-question', type=str, required=True)
54 | parser.add_argument('-model', type=int, default=2)
55 | args = parser.parse_args()
56 | if args.model != 1 and args.model != 2:
57 | print('Invalid model selection.')
58 | sys.exit()
59 | top_answers = generate_answer(args.image, args.question, args.model)
60 | print('Top answers: %s, %s, %s.' % (top_answers[0],top_answers[1],top_answers[2]))
61 |
62 | if __name__ == '__main__':main()
63 |
--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
1 | from datetime import date, timedelta
2 | from random import randint
3 | from time import sleep
4 | import sys
5 | import subprocess
6 | import os
7 |
8 |
9 | def get_date_string(n, startdate):
10 | d = startdate - timedelta(days=n)
11 | rtn = d.strftime("%a %b %d %X %Y %z -0400")
12 | return rtn
13 |
14 | # main app
15 | def main(argv):
16 | if len(argv) < 1 or len(argv) > 2:
17 | print "Error: Bad input."
18 | sys.exit(1)
19 | n = int(argv[0])
20 | if len(argv) == 1:
21 | startdate = date.today()
22 | if len(argv) == 2:
23 | startdate = date(int(argv[1][0:4]), int(argv[1][5:7]), int(argv[1][8:10]))
24 | i = 0
25 | while i <= n:
26 | curdate = get_date_string(i, startdate)
27 | num_commits = randint(1, 10)
28 | for commit in range(0, num_commits):
29 | subprocess.call("echo '" + curdate + str(randint(0, 1000000)) +"' > data.txt; git add data.txt; GIT_AUTHOR_DATE='" + curdate + "' GIT_COMMITTER_DATE='" + curdate + "' git commit -m 'Update CNN'; git push;", shell=True)
30 | sleep(.5)
31 | i += 1
32 | subprocess.call("git rm data.txt; git commit -m 'Reconfigure Model'; git push;", shell=True)
33 |
34 | if __name__ == "__main__":
35 | main(sys.argv[1:])
36 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import prepare_data
3 | import models
4 | import argparse
5 | import sys
6 |
7 | def main():
8 | parser = argparse.ArgumentParser()
9 | parser.add_argument('-num_epochs', type=int, default=25)
10 | parser.add_argument('-batch_size', type=int, default=200)
11 | parser.add_argument('-model', type=int, default=1)
12 | args = parser.parse_args()
13 |
14 | print('Loading questions ...')
15 | questions_train = prepare_data.get_questions_matrix('train')
16 | questions_val = prepare_data.get_questions_matrix('val')
17 | print('Loading answers ...')
18 | answers_train = prepare_data.get_answers_matrix('train')
19 | answers_val = prepare_data.get_answers_matrix('val')
20 | print('Loading image features ...')
21 | img_features_train = prepare_data.get_coco_features('train')
22 | img_features_val = prepare_data.get_coco_features('val')
23 | print('Creating model ...')
24 |
25 | if args.model == 1:
26 | model = models.vis_lstm()
27 | X_train = [img_features_train, questions_train]
28 | X_val = [img_features_val, questions_val]
29 | model_path = 'weights/model_1.h5'
30 | elif args.model == 2:
31 | model = models.vis_lstm_2()
32 | X_train = [img_features_train, questions_train, img_features_train]
33 | X_val = [img_features_val, questions_val, img_features_val]
34 | model_path = 'weights/model_2.h5'
35 | else:
36 | print('Invalid model selection!\nAvailable choices: 1 for vis-lstm and 2 for 2-vis-lstm.')
37 | sys.exit()
38 |
39 | model.compile(optimizer='adam',
40 | loss='categorical_crossentropy',
41 | metrics=['accuracy'])
42 |
43 | model.fit(X_train,answers_train,
44 | nb_epoch=args.num_epochs,
45 | batch_size=args.batch_size,
46 | validation_data=(X_val,answers_val),
47 | verbose=1)
48 |
49 | model.save(model_path)
50 |
51 | if __name__ == '__main__':main()
52 |
53 |
--------------------------------------------------------------------------------
/weights/README.md:
--------------------------------------------------------------------------------
1 | Download the pretrained VGG 16 weights from https://drive.google.com/file/d/0Bz7KyqmuGsilT0J5dmRCM0ROVHc/view and place them in this folder. This is required for making predictions on your own images.
2 |
3 |
--------------------------------------------------------------------------------