├── .gitignore
├── LICENSE.txt
├── README.md
├── data
├── README.md
├── download.sh
└── preprocessed
│ └── README.md
├── experiments
├── generate_submission_test.py
└── train_lstm_1_vqa_test.py
├── features
├── README.md
├── coco_vgg_IDMap.txt
└── download.sh
├── models
├── README.md
├── lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3.json
└── lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5
├── results
└── README.md
└── scripts
├── README.md
├── demo_batch.py
├── dumpText.py
├── evaluateLSTM.py
├── evaluateMLP.py
├── extract_features.py
├── features.py
├── get_started.sh
├── own_image.py
├── trainLSTM_1.py
├── trainLSTM_language.py
├── trainMLP.py
├── utils.py
└── vgg_features.prototxt
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.pyo
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2015 Avi Singh
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Deep Learning for Visual Question Answering
2 |
3 | [Click here](https://avisingh599.github.io/deeplearning/visual-qa/) to go to the accompanying blog post.
4 |
5 | This project uses Keras to train a variety of **Feedforward** and **Recurrent Neural Networks** for the task of Visual Question Answering. It is designed to work with the [VQA](http://visualqa.org) dataset.
6 |
7 | Models Implemented:
8 |
9 | |BOW+CNN Model | LSTM + CNN Model |
10 | |--------------------------------------|-------------------------|
11 | |
|
|
12 |
13 |
14 | ## Requirements
15 | 1. [Keras 0.20](http://keras.io/)
16 | 2. [spaCy 0.94](http://spacy.io/)
17 | 3. [scikit-learn 0.16](http://scikit-learn.org/)
18 | 4. [progressbar](https://pypi.python.org/pypi/progressbar)
19 | 5. Nvidia CUDA 7.5 (optional, for GPU acceleration)
20 | 6. Caffe (Optional)
21 |
22 | Tested with Python 2.7 on Ubuntu 14.04 and Centos 7.1.
23 |
24 | **Notes**:
25 |
26 | 1. Keras needs the latest Theano, which in turn needs Numpy/Scipy.
27 | 2. spaCy is currently used only for converting questions to a vector (or a sequence of vectors), this dependency can be easily be removed if you want to.
28 | 3. spaCy uses Goldberg and Levy's word vectors by default, but I found the performance to be much superior with Stanford's [Glove word vectors](http://nlp.stanford.edu/projects/glove/).
29 | 4. VQA Tools is **not** needed.
30 | 5. Caffe (Optional) - For using the VQA with your own images.
31 |
32 | ## Installation Guide
33 | This project has a large number of dependecies, and I am yet to make a comprehensive installation guide. In the meanwhile, you can use the following guide made by @gajumaru4444:
34 |
35 | 1. [Prepare for VQA in Ubuntu 14.04 x64 Part 1](https://gajumaru4444.github.io/2015/11/10/Visual-Question-Answering-2.html)
36 | 2. [Prepare for VQA in Ubuntu 14.04 x64 Part 2](https://gajumaru4444.github.io/2015/11/18/Visual-Question-Answering-3.html)
37 |
38 | If you intend to use my pre-trained models, you would also need to replace spaCy's default word vectors with the GloVe word vectors from Stanford. You can find more details [here](http://spacy.io/tutorials/load-new-word-vectors/) on how to do this.
39 |
40 | ## Using Pre-trained models
41 | Take a look at `scripts/demo_batch.py`. An LSTM-based pre-trained model has been released. It currently works only on the images of the MS COCO dataset (need to be downloaded separately), since I have pre-computed the VGG features for them. I do intend to add a pipeline for computing features for other images.
42 |
43 | **Caution**: Use the pre-trained model with 300D Common Crawl Glove Word Embeddings. Do not the the default spaCy embeddings (Goldberg and Levy 2014). If you try to use these pre-trained models with any embeddings except Glove, your results would be **garbage**. You can find more deatails [here](http://spacy.io/tutorials/load-new-word-vectors/) on how to do this.
44 |
45 | ## Using your own images
46 |
47 | Now you can use your own images with the `scripts/own_image.py` script. Use it like :
48 |
49 | python own_image.py --caffe /path/to/caffe
50 |
51 | For now, a Caffe installation is required. However, I'm working on a Keras based VGG Net which should be up soon. Download the VGG Caffe model weights from [here](http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_16_layers.caffemodel) and place it in the scripts folder.
52 |
53 | ## The Numbers
54 | Performance on the **validation set** and the **test-dev set** of the [VQA Challenge](http://visualqa.org/challenge.html):
55 |
56 | | Model | val | test-dev |
57 | | ---------------------|:-------------:|:-------------:|
58 | | BOW+CNN | 48.46% | TODO |
59 | | LSTM-Language only | 44.17% | TODO |
60 | | LSTM+CNN | 51.63% | 53.34% |
61 |
62 | Note: For validation set, the model was trained on the training set, while it was trained on both training and validation set for the test-dev set results.
63 |
64 | There is a **lot** of scope for hyperparameter tuning here. Experiments were done for 100 epochs.
65 |
66 | Training Time on various hardware:
67 |
68 | | Model | GTX 760 | Intel Core i7 |
69 | | ---------------------|:-------------------:|:-------------------:|
70 | | BOW+CNN | 140 seconds/epoch | 900 seconds/epoch |
71 | | LSTM+CNN | 200 seconds/epoch | 1900 seconds/epoch |
72 |
73 | The above numbers are valid when using a batch size of `128`, and training on 215K examples in every epoch.
74 |
75 | ## Get Started
76 | Have a look at the `get_started.sh` script in the `scripts` folder. Also, have a look at the readme present in each of the folders.
77 |
78 | ## Feedback
79 | All kind of feedback (code style, bugs, comments etc.) is welcome. Please open an issue on this repo instead of mailing me, since it helps me keep track of things better.
80 |
81 | ## License
82 | MIT
83 |
--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
1 | Download and unzip the VQA dataset from here:
2 | http://www.visualqa.org/
3 |
4 | or you can use the download script for the same.
--------------------------------------------------------------------------------
/data/download.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # Downloads the training and validation sets from visualqa.org.
3 |
4 | wget http://visualqa.org/data/mscoco/vqa/Questions_Train_mscoco.zip
5 | wget http://visualqa.org/data/mscoco/vqa/Questions_Val_mscoco.zip
6 | wget http://visualqa.org/data/mscoco/vqa/Annotations_Train_mscoco.zip
7 | wget http://visualqa.org/data/mscoco/vqa/Annotations_Val_mscoco.zip
8 |
9 | unzip \*.zip
--------------------------------------------------------------------------------
/data/preprocessed/README.md:
--------------------------------------------------------------------------------
1 | This is where all the text files are dumped by the dumpText.py script.
--------------------------------------------------------------------------------
/experiments/generate_submission_test.py:
--------------------------------------------------------------------------------
1 | import json
2 | import sys
3 | import argparse
4 | from progressbar import Bar, ETA, Percentage, ProgressBar
5 | from keras.models import model_from_json
6 |
7 | from spacy.en import English
8 | import numpy as np
9 | import scipy.io
10 | from sklearn.externals import joblib
11 |
12 | sys.path.insert(0, '../scripts/')
13 | from features import get_questions_tensor_timeseries, get_images_matrix, get_answers_matrix
14 | from utils import grouper
15 |
16 | def main():
17 |
18 | parser = argparse.ArgumentParser()
19 | parser.add_argument('-model', type=str, required=True)
20 | parser.add_argument('-weights', type=str, required=True)
21 | parser.add_argument('-results', type=str, required=True)
22 | args = parser.parse_args()
23 |
24 | model = model_from_json(open(args.model).read())
25 | model.load_weights(args.weights)
26 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
27 |
28 | questions_test = open('../data/preprocessed/questions_test-dev2015.txt',
29 | 'r').read().decode('utf8').splitlines()
30 | questions_lengths_test = open('../data/preprocessed/questions_lengths_test-dev2015.txt',
31 | 'r').read().decode('utf8').splitlines()
32 | questions_id_test = open('../data/preprocessed/questions_id_test-dev2015.txt',
33 | 'r').read().decode('utf8').splitlines()
34 | images_test = open('../data/preprocessed/images_test-dev2015.txt',
35 | 'r').read().decode('utf8').splitlines()
36 | vgg_model_path = '../features/coco/vgg_feats_test.mat'
37 |
38 | questions_lengths_test, questions_test, images_test, questions_id_test = (list(t) for t in zip(*sorted(zip(questions_lengths_test, questions_test, images_test, questions_id_test))))
39 |
40 | print 'Model compiled, weights loaded'
41 | labelencoder = joblib.load('../models/labelencoder_trainval.pkl')
42 |
43 | features_struct = scipy.io.loadmat(vgg_model_path)
44 | VGGfeatures = features_struct['feats']
45 | print 'Loaded vgg features'
46 | image_ids = open('../features/coco_vgg_IDMap_test.txt').read().splitlines()
47 | img_map = {}
48 | for ids in image_ids:
49 | id_split = ids.split()
50 | img_map[id_split[0]] = int(id_split[1])
51 |
52 | nlp = English()
53 | print 'Loaded word2vec features'
54 |
55 | nb_classes = 1000
56 | y_predict_text = []
57 | batchSize = 128
58 | widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'),
59 | ' ', ETA()]
60 | pbar = ProgressBar(widgets=widgets)
61 |
62 | for qu_batch,im_batch in pbar(zip(grouper(questions_test, batchSize, fillvalue=questions_test[-1]),
63 | grouper(images_test, batchSize, fillvalue=images_test[-1]))):
64 | timesteps = len(nlp(qu_batch[-1])) #questions sorted in descending order of length
65 | X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, timesteps)
66 | if 'language_only' in args.model:
67 | X_batch = X_q_batch
68 | else:
69 | X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures)
70 | X_batch = [X_q_batch, X_i_batch]
71 | y_predict = model.predict_classes(X_batch, verbose=0)
72 | y_predict_text.extend(labelencoder.inverse_transform(y_predict))
73 |
74 | results = []
75 |
76 | f1 = open(args.results, 'w')
77 | for prediction, question, question_id, image in zip(y_predict_text, questions_test, questions_id_test, images_test):
78 | answer = {}
79 | answer['question_id'] = int(question_id)
80 | answer['answer'] = prediction
81 | results.append(answer)
82 |
83 | f1.write(question.encode('utf-8'))
84 | f1.write('\n')
85 | f1.write(image.encode('utf-8'))
86 | f1.write('\n')
87 | f1.write(prediction)
88 | f1.write('\n')
89 | f1.write(question_id.encode('utf-8'))
90 | f1.write('\n')
91 | f1.write('\n')
92 |
93 | f1.close()
94 |
95 | f2 = open('../results/submission_test-dev2015.json', 'w')
96 | f2.write(json.dumps(results))
97 | f2.close()
98 | print 'Results saved to', args.results
99 |
100 | if __name__ == "__main__":
101 | main()
--------------------------------------------------------------------------------
/experiments/train_lstm_1_vqa_test.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import scipy.io
3 | import sys
4 | import argparse
5 |
6 | from keras.models import Sequential
7 | from keras.layers.core import Dense, Activation, Merge, Dropout, Reshape
8 | from keras.layers.recurrent import LSTM
9 | from keras.utils import np_utils, generic_utils
10 | from keras.callbacks import ModelCheckpoint, RemoteMonitor
11 |
12 | from sklearn.externals import joblib
13 | from sklearn import preprocessing
14 |
15 | from spacy.en import English
16 |
17 | sys.path.insert(0, '../scripts/')
18 | from utils import grouper, selectFrequentAnswers
19 | from features import get_images_matrix, get_answers_matrix, get_questions_tensor_timeseries
20 |
21 |
22 | def main():
23 | parser = argparse.ArgumentParser()
24 | parser.add_argument('-num_hidden_units_mlp', type=int, default=1024)
25 | parser.add_argument('-num_hidden_units_lstm', type=int, default=512)
26 | parser.add_argument('-num_hidden_layers_mlp', type=int, default=3)
27 | parser.add_argument('-num_hidden_layers_lstm', type=int, default=1)
28 | parser.add_argument('-dropout', type=float, default=0.5)
29 | parser.add_argument('-activation_mlp', type=str, default='tanh')
30 | parser.add_argument('-num_epochs', type=int, default=100)
31 | parser.add_argument('-model_save_interval', type=int, default=5)
32 | parser.add_argument('-batch_size', type=int, default=128)
33 | #TODO Feature parser.add_argument('-resume_training', type=str)
34 | #TODO Feature parser.add_argument('-language_only', type=bool, default= False)
35 | args = parser.parse_args()
36 |
37 | word_vec_dim= 300
38 | img_dim = 4096
39 | max_len = 30
40 | nb_classes = 1000
41 |
42 | #get the data
43 | questions_train = open('../data/preprocessed/questions_train2014.txt', 'r').read().decode('utf8').splitlines()
44 | questions_lengths_train = open('../data/preprocessed/questions_lengths_train2014.txt', 'r').read().decode('utf8').splitlines()
45 | answers_train = open('../data/preprocessed/answers_train2014_modal.txt', 'r').read().decode('utf8').splitlines()
46 | images_train = open('../data/preprocessed/images_train2014.txt', 'r').read().decode('utf8').splitlines()
47 |
48 | questions_val = open('../data/preprocessed/questions_val2014.txt', 'r').read().decode('utf8').splitlines()
49 | questions_lengths_val = open('../data/preprocessed/questions_lengths_val2014.txt', 'r').read().decode('utf8').splitlines()
50 | answers_val = open('../data/preprocessed/answers_val2014_modal.txt', 'r').read().decode('utf8').splitlines()
51 | images_val = open('../data/preprocessed/images_val2014.txt', 'r').read().decode('utf8').splitlines()
52 |
53 | questions_train = questions_train + questions_val
54 | questions_lengths_train = questions_lengths_train + questions_lengths_val
55 | answers_train = answers_train + answers_val
56 | images_train = images_train + images_val
57 |
58 | vgg_model_path = '../features/coco/vgg_feats.mat'
59 |
60 | max_answers = nb_classes
61 | questions_train, answers_train, images_train = selectFrequentAnswers(questions_train,answers_train,images_train, max_answers)
62 | questions_lengths_train, questions_train, answers_train, images_train = (list(t) for t in zip(*sorted(zip(questions_lengths_train, questions_train, answers_train, images_train))))
63 |
64 | #encode the remaining answers
65 | labelencoder = preprocessing.LabelEncoder()
66 | labelencoder.fit(answers_train)
67 | nb_classes = len(list(labelencoder.classes_))
68 | joblib.dump(labelencoder,'../models/labelencoder_trainval.pkl')
69 |
70 | image_model = Sequential()
71 | image_model.add(Reshape(input_shape = (img_dim,), dims=(img_dim,)))
72 |
73 | language_model = Sequential()
74 | if args.num_hidden_layers_lstm == 1:
75 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=False, input_shape=(max_len, word_vec_dim)))
76 | else:
77 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=True, input_shape=(max_len, word_vec_dim)))
78 | for i in xrange(args.num_hidden_layers_lstm-2):
79 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=True))
80 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=False))
81 |
82 | model = Sequential()
83 | model.add(Merge([language_model, image_model], mode='concat', concat_axis=1))
84 | for i in xrange(args.num_hidden_layers_mlp):
85 | model.add(Dense(args.num_hidden_units_mlp, init='uniform'))
86 | model.add(Activation(args.activation_mlp))
87 | model.add(Dropout(args.dropout))
88 | model.add(Dense(nb_classes))
89 | model.add(Activation('softmax'))
90 |
91 | json_string = model.to_json()
92 | model_file_name = '../models/FULL_lstm_1_num_hidden_units_lstm_' + str(args.num_hidden_units_lstm) + \
93 | '_num_hidden_units_mlp_' + str(args.num_hidden_units_mlp) + '_num_hidden_layers_mlp_' + \
94 | str(args.num_hidden_layers_mlp) + '_num_hidden_layers_lstm_' + str(args.num_hidden_layers_lstm)
95 | open(model_file_name + '.json', 'w').write(json_string)
96 |
97 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
98 | print 'Compilation done'
99 |
100 | features_struct = scipy.io.loadmat(vgg_model_path)
101 | VGGfeatures = features_struct['feats']
102 | print 'loaded vgg features'
103 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines()
104 | img_map = {}
105 | for ids in image_ids:
106 | id_split = ids.split()
107 | img_map[id_split[0]] = int(id_split[1])
108 |
109 | nlp = English()
110 | print 'loaded word2vec features...'
111 | ## training
112 | print 'Training started...'
113 | for k in xrange(args.num_epochs):
114 |
115 | progbar = generic_utils.Progbar(len(questions_train))
116 |
117 | for qu_batch,an_batch,im_batch in zip(grouper(questions_train, args.batch_size, fillvalue=questions_train[-1]),
118 | grouper(answers_train, args.batch_size, fillvalue=answers_train[-1]),
119 | grouper(images_train, args.batch_size, fillvalue=images_train[-1])):
120 | timesteps = len(nlp(qu_batch[-1])) #questions sorted in descending order of length
121 | X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, timesteps)
122 | X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures)
123 | Y_batch = get_answers_matrix(an_batch, labelencoder)
124 | loss = model.train_on_batch([X_q_batch, X_i_batch], Y_batch)
125 | progbar.add(args.batch_size, values=[("train loss", loss)])
126 |
127 |
128 | if k%args.model_save_interval == 0:
129 | model.save_weights(model_file_name + '_epoch_{:03d}.hdf5'.format(k))
130 |
131 | model.save_weights(model_file_name + '_epoch_{:03d}.hdf5'.format(k))
132 |
133 | if __name__ == "__main__":
134 | main()
--------------------------------------------------------------------------------
/features/README.md:
--------------------------------------------------------------------------------
1 | Download and unzip the features from here:
2 | http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip
--------------------------------------------------------------------------------
/features/download.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # Downloads and unzips the VGG features computed on the COCO dataset.
3 |
4 | wget http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip
5 | unzip coco.zip -d .
--------------------------------------------------------------------------------
/models/README.md:
--------------------------------------------------------------------------------
1 | This folder will contain all the model configurations in the json files and all the model weights in the hdf5 or h5 files.
2 |
--------------------------------------------------------------------------------
/models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3.json:
--------------------------------------------------------------------------------
1 | {"layers": [{"layers": [{"layers": [{"truncate_gradient": -1, "name": "LSTM", "inner_activation": "hard_sigmoid", "activation": "tanh", "input_shape": [30, 300], "init": "glorot_uniform", "inner_init": "orthogonal", "input_dim": null, "return_sequences": false, "output_dim": 512, "forget_bias_init": "one", "input_length": null}], "name": "Sequential"}, {"layers": [{"dims": [4096], "name": "Reshape", "input_shape": [4096]}], "name": "Sequential"}], "mode": "concat", "name": "Merge", "concat_axis": 1}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1024}, {"beta": 0.1, "activation": "tanh", "name": "Activation", "target": 0}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1024}, {"beta": 0.1, "activation": "tanh", "name": "Activation", "target": 0}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1024}, {"beta": 0.1, "activation": "tanh", "name": "Activation", "target": 0}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1000}, {"beta": 0.1, "activation": "softmax", "name": "Activation", "target": 0}], "name": "Sequential"}
--------------------------------------------------------------------------------
/models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avisingh599/visual-qa/99be95d61bf9302495e741fa53cf63b7e9a91a35/models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5
--------------------------------------------------------------------------------
/results/README.md:
--------------------------------------------------------------------------------
1 | This folder contains the predictions made by the different models for the different models. The ```overall_results.txt``` contains the performance of all the individual models.
--------------------------------------------------------------------------------
/scripts/README.md:
--------------------------------------------------------------------------------
1 | Here is the utility of the various files:
2 |
3 | 0. `demo_batch.py`: You need access to pretrained models (included in the repo to run this example)
4 |
5 | 1. `get_started.sh`: Downloads data, VQAtools, pre-computed features, and trains a model. Run this script when you are done with the dependencies.
6 |
7 | 2. `dumpText.py`: Dumps the questions and answers from the VQA json files to some text files for later ease of use. Run `python dumpText.py -h` for more info.
8 |
9 | 3. `trainMLP.py`: Trains Multi-Layer perceptrons. Run `python trainMLP.py -h` for more info.
10 |
11 | 4. `trainLSTM_1.py`: Trains LSTM-based model. Run `python trainLSTM_1.py -h` for more info.
12 |
13 | 6. `trainLSTM_language.py`: Trains LSTM-based language-only model. Run `python trainLSTM_language.py -h` for more info.
14 |
15 | 7. `evaluateMLP.py`: Evaluates models trained by `trainMLP.py`. Needs model json file, hdf5 weights file, and output txt file destinations to run.
16 |
17 | 8. `evaluateLSTM.py`: Evaluates models trained by `trainLSTM_1.py` and `trainLSTM_language.py`. Needs model json file, hdf5 weights file, and output txt file destinations to run.
18 |
19 | 9. `features.py`: Contains functions that are used to convert images and words to vectors (or sequences of vectors).
20 |
21 | 10. `utils.py`: Exactly what you think.
22 |
23 | 11. `own_image.py`: Use your own image. Caffe installation required
24 |
25 | 12. `extract_features.py`: Extract 4096D VGG features from a VGG Caffe Model
26 |
27 | 13. `vgg_features.prototxt`: VGG Caffe Model Definition
28 |
--------------------------------------------------------------------------------
/scripts/demo_batch.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import random
3 | from PIL import Image
4 | import subprocess
5 | from os import listdir
6 | from os.path import isfile, join
7 |
8 | from keras.models import model_from_json
9 |
10 | from spacy.en import English
11 | import numpy as np
12 | import scipy.io
13 | from sklearn.externals import joblib
14 |
15 | from features import get_questions_tensor_timeseries, get_images_matrix, get_answers_matrix
16 |
17 | def main():
18 | '''
19 | Before runnning this demo ensure that you have some images from the MS COCO validation set
20 | saved somewhere, and update the image_dir variable accordingly
21 | Also, this demo is designed to run with the models released with the visual-qa repo, if you
22 | would like to get use it with some other model (say an MLP based model or a langauge-only model)
23 | you will have to make some changes.
24 | '''
25 | image_dir = '../../vqa_images/'
26 | local_images = [ f for f in listdir(image_dir) if isfile(join(image_dir,f)) ]
27 |
28 | parser = argparse.ArgumentParser()
29 | parser.add_argument('-model', type=str, default='../models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3.json')
30 | parser.add_argument('-weights', type=str, default='../models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5')
31 | parser.add_argument('-sample_size', type=int, default=25)
32 | args = parser.parse_args()
33 |
34 | model = model_from_json(open(args.model).read())
35 | model.load_weights(args.weights)
36 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
37 | print 'Model loaded and compiled'
38 | images_val = open('../data/preprocessed/images_val2014.txt',
39 | 'r').read().decode('utf8').splitlines()
40 |
41 | nlp = English()
42 | print 'Loaded word2vec features'
43 | labelencoder = joblib.load('../models/labelencoder.pkl')
44 |
45 | vgg_model_path = '../features/coco/vgg_feats.mat'
46 | features_struct = scipy.io.loadmat(vgg_model_path)
47 | VGGfeatures = features_struct['feats']
48 | print 'Loaded vgg features'
49 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines()
50 | img_map = {}
51 | for ids in image_ids:
52 | id_split = ids.split()
53 | img_map[id_split[0]] = int(id_split[1])
54 |
55 | image_sample = random.sample(local_images, args.sample_size)
56 |
57 | for image in image_sample:
58 | p = subprocess.Popen(["display", image_dir + image])
59 | q = unicode(raw_input("Ask a question about the image:"))
60 | coco_id = str(int(image[-16:-4]))
61 | timesteps = len(nlp(q)) #questions sorted in descending order of length
62 | X_q = get_questions_tensor_timeseries([q], nlp, timesteps)
63 | X_i = get_images_matrix([coco_id], img_map, VGGfeatures)
64 | X = [X_q, X_i]
65 | y_predict = model.predict_classes(X, verbose=0)
66 | print labelencoder.inverse_transform(y_predict)
67 | raw_input('Press enter to continue...')
68 | p.kill()
69 |
70 | if __name__ == "__main__":
71 | main()
72 |
--------------------------------------------------------------------------------
/scripts/dumpText.py:
--------------------------------------------------------------------------------
1 | import operator
2 | import argparse
3 | import progressbar
4 | import json
5 | from spacy.en import English
6 |
7 | def getModalAnswer(answers):
8 | candidates = {}
9 | for i in xrange(10):
10 | candidates[answers[i]['answer']] = 1
11 |
12 | for i in xrange(10):
13 | candidates[answers[i]['answer']] += 1
14 |
15 | return max(candidates.iteritems(), key=operator.itemgetter(1))[0]
16 |
17 | def getAllAnswer(answers):
18 | answer_list = []
19 | for i in xrange(10):
20 | answer_list.append(answers[i]['answer'])
21 |
22 | return ';'.join(answer_list)
23 |
24 | def main():
25 | parser = argparse.ArgumentParser()
26 | parser.add_argument('-split', type=str, default='train',
27 | help='Specify which part of the dataset you want to dump to text. Your options are: train, val, test, test-dev')
28 | parser.add_argument('-answers', type=str, default='modal',
29 | help='Specify if you want to dump just the most frequent answer for each questions (modal), or all the answers (all)')
30 | args = parser.parse_args()
31 |
32 | nlp = English() #used for conting number of tokens
33 |
34 | if args.split == 'train':
35 | annFile = '../data/mscoco_train2014_annotations.json'
36 | quesFile = '../data/OpenEnded_mscoco_train2014_questions.json'
37 | questions_file = open('../data/preprocessed/questions_train2014.txt', 'w')
38 | questions_id_file = open('../data/preprocessed/questions_id_train2014.txt', 'w')
39 | questions_lengths_file = open('../data/preprocessed/questions_lengths_train2014.txt', 'w')
40 | if args.answers == 'modal':
41 | answers_file = open('../data/preprocessed/answers_train2014_modal.txt', 'w')
42 | elif args.answers == 'all':
43 | answers_file = open('../data/preprocessed/answers_train2014_all.txt', 'w')
44 | coco_image_id = open('../data/preprocessed/images_train2014.txt', 'w')
45 | data_split = 'training data'
46 | elif args.split == 'val':
47 | annFile = '../data/mscoco_val2014_annotations.json'
48 | quesFile = '../data/OpenEnded_mscoco_val2014_questions.json'
49 | questions_file = open('../data/preprocessed/questions_val2014.txt', 'w')
50 | questions_id_file = open('../data/preprocessed/questions_id_val2014.txt', 'w')
51 | questions_lengths_file = open('../data/preprocessed/questions_lengths_val2014.txt', 'w')
52 | if args.answers == 'modal':
53 | answers_file = open('../data/preprocessed/answers_val2014_modal.txt', 'w')
54 | elif args.answers == 'all':
55 | answers_file = open('../data/preprocessed/answers_val2014_all.txt', 'w')
56 | coco_image_id = open('../data/preprocessed/images_val2014_all.txt', 'w')
57 | data_split = 'validation data'
58 | elif args.split == 'test-dev':
59 | quesFile = '../data/OpenEnded_mscoco_test-dev2015_questions.json'
60 | questions_file = open('../data/preprocessed/questions_test-dev2015.txt', 'w')
61 | questions_id_file = open('../data/preprocessed/questions_id_test-dev2015.txt', 'w')
62 | questions_lengths_file = open('../data/preprocessed/questions_lengths_test-dev2015.txt', 'w')
63 | coco_image_id = open('../data/preprocessed/images_test-dev2015.txt', 'w')
64 | data_split = 'test-dev data'
65 | elif args.split == 'test':
66 | quesFile = '../data/OpenEnded_mscoco_test2015_questions.json'
67 | questions_file = open('../data/preprocessed/questions_test2015.txt', 'w')
68 | questions_id_file = open('../data/preprocessed/questions_id_test2015.txt', 'w')
69 | questions_lengths_file = open('../data/preprocessed/questions_lengths_test2015.txt', 'w')
70 | coco_image_id = open('../data/preprocessed/images_test2015.txt', 'w')
71 | data_split = 'test data'
72 | else:
73 | raise RuntimeError('Incorrect split. Your choices are:\ntrain\nval\ntest-dev\ntest')
74 |
75 | #initialize VQA api for QA annotations
76 | #vqa=VQA(annFile, quesFile)
77 | questions = json.load(open(quesFile, 'r'))
78 | ques = questions['questions']
79 | if args.split == 'train' or args.split == 'val':
80 | qa = json.load(open(annFile, 'r'))
81 | qa = qa['annotations']
82 |
83 | pbar = progressbar.ProgressBar()
84 | print 'Dumping questions, answers, questionIDs, imageIDs, and questions lengths to text files...'
85 | for i, q in pbar(zip(xrange(len(ques)),ques)):
86 | questions_file.write((q['question'] + '\n').encode('utf8'))
87 | questions_lengths_file.write((str(len(nlp(q['question'])))+ '\n').encode('utf8'))
88 | questions_id_file.write((str(q['question_id']) + '\n').encode('utf8'))
89 | coco_image_id.write((str(q['image_id']) + '\n').encode('utf8'))
90 | if args.split == 'train' or args.split == 'val':
91 | if args.answers == 'modal':
92 | answers_file.write(getModalAnswer(qa[i]['answers']).encode('utf8'))
93 | elif args.answers == 'all':
94 | answers_file.write(getAllAnswer(qa[i]['answers']).encode('utf8'))
95 | answers_file.write('\n'.encode('utf8'))
96 |
97 | print 'completed dumping', data_split
98 |
99 | if __name__ == "__main__":
100 | main()
--------------------------------------------------------------------------------
/scripts/evaluateLSTM.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | from progressbar import Bar, ETA, Percentage, ProgressBar
3 | from keras.models import model_from_json
4 |
5 | from spacy.en import English
6 | import numpy as np
7 | import scipy.io
8 | from sklearn.externals import joblib
9 |
10 | from features import get_questions_tensor_timeseries, get_images_matrix, get_answers_matrix
11 | from utils import grouper
12 |
13 | def main():
14 |
15 | parser = argparse.ArgumentParser()
16 | parser.add_argument('-model', type=str, required=True)
17 | parser.add_argument('-weights', type=str, required=True)
18 | parser.add_argument('-results', type=str, required=True)
19 | args = parser.parse_args()
20 |
21 | model = model_from_json(open(args.model).read())
22 | model.load_weights(args.weights)
23 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
24 |
25 | questions_val = open('../data/preprocessed/questions_val2014.txt',
26 | 'r').read().decode('utf8').splitlines()
27 | questions_lengths_val = open('../data/preprocessed/questions_lengths_val2014.txt',
28 | 'r').read().decode('utf8').splitlines()
29 | answers_val = open('../data/preprocessed/answers_val2014_all.txt',
30 | 'r').read().decode('utf8').splitlines()
31 | images_val = open('../data/preprocessed/images_val2014.txt',
32 | 'r').read().decode('utf8').splitlines()
33 | vgg_model_path = '../features/coco/vgg_feats.mat'
34 |
35 | questions_lengths_val, questions_val, answers_val, images_val = (list(t) for t in zip(*sorted(zip(questions_lengths_val, questions_val, answers_val, images_val))))
36 |
37 | print 'Model compiled, weights loaded'
38 | labelencoder = joblib.load('../models/labelencoder.pkl')
39 |
40 | features_struct = scipy.io.loadmat(vgg_model_path)
41 | VGGfeatures = features_struct['feats']
42 | print 'Loaded vgg features'
43 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines()
44 | img_map = {}
45 | for ids in image_ids:
46 | id_split = ids.split()
47 | img_map[id_split[0]] = int(id_split[1])
48 |
49 | nlp = English()
50 | print 'Loaded word2vec features'
51 |
52 | nb_classes = 1000
53 | y_predict_text = []
54 | batchSize = 128
55 | widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'),
56 | ' ', ETA()]
57 | pbar = ProgressBar(widgets=widgets)
58 |
59 | for qu_batch,an_batch,im_batch in pbar(zip(grouper(questions_val, batchSize, fillvalue=questions_val[0]),
60 | grouper(answers_val, batchSize, fillvalue=answers_val[0]),
61 | grouper(images_val, batchSize, fillvalue=images_val[0]))):
62 | timesteps = len(nlp(qu_batch[-1])) #questions sorted in descending order of length
63 | X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, timesteps)
64 | if 'language_only' in args.model:
65 | X_batch = X_q_batch
66 | else:
67 | X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures)
68 | X_batch = [X_q_batch, X_i_batch]
69 | y_predict = model.predict_classes(X_batch, verbose=0)
70 | y_predict_text.extend(labelencoder.inverse_transform(y_predict))
71 |
72 | total = 0
73 | correct_val=0.0
74 | f1 = open(args.results, 'w')
75 | for prediction, truth, question, image in zip(y_predict_text, answers_val, questions_val, images_val):
76 | temp_count=0
77 | for _truth in truth.split(';'):
78 | if prediction == _truth:
79 | temp_count+=1
80 |
81 | if temp_count>2:
82 | correct_val+=1
83 | else:
84 | correct_val+=float(temp_count)/3
85 |
86 | total+=1
87 |
88 | f1.write(question.encode('utf-8'))
89 | f1.write('\n')
90 | f1.write(image.encode('utf-8'))
91 | f1.write('\n')
92 | f1.write(prediction)
93 | f1.write('\n')
94 | f1.write(truth.encode('utf-8'))
95 | f1.write('\n')
96 | f1.write('\n')
97 |
98 | f1.write('Final Accuracy is ' + str(correct_val/total))
99 | f1.close()
100 | f1 = open('../results/overall_results.txt', 'a')
101 | f1.write(args.weights + '\n')
102 | f1.write(str(correct_val/total) + '\n\n')
103 | f1.close()
104 | print 'Final Accuracy on the validation set is', correct_val/total
105 |
106 | if __name__ == "__main__":
107 | main()
--------------------------------------------------------------------------------
/scripts/evaluateMLP.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import argparse
3 | from progressbar import Bar, ETA, Percentage, ProgressBar
4 | from keras.models import model_from_json
5 |
6 | from spacy.en import English
7 | import numpy as np
8 | import scipy.io
9 | from sklearn.externals import joblib
10 |
11 | from features import get_questions_matrix_sum, get_images_matrix, get_answers_matrix
12 | from utils import grouper
13 |
14 | def main():
15 | parser = argparse.ArgumentParser()
16 | parser.add_argument('-model', type=str, required=True)
17 | parser.add_argument('-weights', type=str, required=True)
18 | parser.add_argument('-results', type=str, required=True)
19 | args = parser.parse_args()
20 |
21 | model = model_from_json(open(args.model).read())
22 | model.load_weights(args.weights)
23 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
24 |
25 | questions_val = open('../data/preprocessed/questions_val2014.txt',
26 | 'r').read().decode('utf8').splitlines()
27 | answers_val = open('../data/preprocessed/answers_val2014_all.txt',
28 | 'r').read().decode('utf8').splitlines()
29 | images_val = open('../data/preprocessed/images_val2014.txt',
30 | 'r').read().decode('utf8').splitlines()
31 | vgg_model_path = '../features/coco/vgg_feats.mat'
32 |
33 | print 'Model compiled, weights loaded...'
34 | labelencoder = joblib.load('../models/labelencoder.pkl')
35 |
36 | features_struct = scipy.io.loadmat(vgg_model_path)
37 | VGGfeatures = features_struct['feats']
38 | print 'loaded vgg features'
39 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines()
40 | img_map = {}
41 | for ids in image_ids:
42 | id_split = ids.split()
43 | img_map[id_split[0]] = int(id_split[1])
44 |
45 | nlp = English()
46 | print 'loaded word2vec features'
47 |
48 | nb_classes = 1000
49 | y_predict_text = []
50 | batchSize = 128
51 | widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'),
52 | ' ', ETA()]
53 | pbar = ProgressBar(widgets=widgets)
54 |
55 | for qu_batch,an_batch,im_batch in pbar(zip(grouper(questions_val, batchSize, fillvalue=questions_val[0]),
56 | grouper(answers_val, batchSize, fillvalue=answers_val[0]),
57 | grouper(images_val, batchSize, fillvalue=images_val[0]))):
58 | X_q_batch = get_questions_matrix_sum(qu_batch, nlp)
59 | if 'language_only' in args.model:
60 | X_batch = X_q_batch
61 | else:
62 | X_i_batch = get_images_matrix(im_batch, img_map , VGGfeatures)
63 | X_batch = np.hstack((X_q_batch, X_i_batch))
64 | y_predict = model.predict_classes(X_batch, verbose=0)
65 | y_predict_text.extend(labelencoder.inverse_transform(y_predict))
66 |
67 | correct_val=0.0
68 | total=0
69 | f1 = open(args.results, 'w')
70 |
71 | for prediction, truth, question, image in zip(y_predict_text, answers_val, questions_val, images_val):
72 | temp_count=0
73 | for _truth in truth.split(';'):
74 | if prediction == _truth:
75 | temp_count+=1
76 |
77 | if temp_count>2:
78 | correct_val+=1
79 | else:
80 | correct_val+= float(temp_count)/3
81 |
82 | total+=1
83 | f1.write(question.encode('utf-8'))
84 | f1.write('\n')
85 | f1.write(image.encode('utf-8'))
86 | f1.write('\n')
87 | f1.write(prediction)
88 | f1.write('\n')
89 | f1.write(truth.encode('utf-8'))
90 | f1.write('\n')
91 | f1.write('\n')
92 |
93 | f1.write('Final Accuracy is ' + str(correct_val/total))
94 | f1.close()
95 | f1 = open('../results/overall_results.txt', 'a')
96 | f1.write(args.weights + '\n')
97 | f1.write(str(correct_val/total) + '\n')
98 | f1.close()
99 | print 'Final Accuracy on the validation set is', correct_val/total
100 |
101 | if __name__ == "__main__":
102 | main()
103 |
--------------------------------------------------------------------------------
/scripts/extract_features.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os.path
3 | import argparse
4 |
5 | import numpy as np
6 | from scipy.misc import imread, imresize
7 | import scipy.io
8 |
9 | parser = argparse.ArgumentParser()
10 | parser.add_argument('--caffe', help='path to caffe installation')
11 | parser.add_argument('--model_def', help='path to model definition prototxt')
12 | parser.add_argument('--model', help='path to model parameters')
13 | parser.add_argument('--gpu', action='store_true', help='whether to use gpu')
14 | parser.add_argument('--image', help='path to image')
15 |
16 | args = parser.parse_args()
17 |
18 | if args.caffe:
19 | caffepath = args.caffe + '/python'
20 | sys.path.append(caffepath)
21 |
22 | import caffe
23 |
24 | def predict(in_data, net):
25 |
26 | out = net.forward(**{net.inputs[0]: in_data})
27 | features = out[net.outputs[0]]
28 | return features
29 |
30 |
31 | def batch_predict(filenames, net):
32 |
33 | N, C, H, W = net.blobs[net.inputs[0]].data.shape
34 | F = net.blobs[net.outputs[0]].data.shape[1]
35 | Nf = len(filenames)
36 | Hi, Wi, _ = imread(filenames[0]).shape
37 | allftrs = np.zeros((Nf, F))
38 | for i in range(0, Nf, N):
39 | in_data = np.zeros((N, C, H, W), dtype=np.float32)
40 |
41 | batch_range = range(i, min(i+N, Nf))
42 | batch_filenames = [filenames[j] for j in batch_range]
43 | Nb = len(batch_range)
44 |
45 | batch_images = np.zeros((Nb, 3, H, W))
46 | for j,fname in enumerate(batch_filenames):
47 | im = imread(fname)
48 | if len(im.shape) == 2:
49 | im = np.tile(im[:,:,np.newaxis], (1,1,3))
50 | # RGB -> BGR
51 | im = im[:,:,(2,1,0)]
52 | # mean subtraction
53 | im = im - np.array([103.939, 116.779, 123.68])
54 | # resize
55 | im = imresize(im, (H, W), 'bicubic')
56 | # get channel in correct dimension
57 | im = np.transpose(im, (2, 0, 1))
58 | batch_images[j,:,:,:] = im
59 |
60 | # insert into correct place
61 | in_data[0:len(batch_range), :, :, :] = batch_images
62 |
63 | # predict features
64 | ftrs = predict(in_data, net)
65 |
66 | for j in range(len(batch_range)):
67 | allftrs[i+j,:] = ftrs[j,:]
68 |
69 | print 'Done %d/%d files' % (i+len(batch_range), len(filenames))
70 |
71 | return allftrs
72 |
73 |
74 | if args.gpu:
75 | caffe.set_mode_gpu()
76 | else:
77 | caffe.set_mode_cpu()
78 |
79 | net = caffe.Net(args.model_def, args.model, caffe.TEST)
80 |
81 | base_dir = os.path.dirname(args.image)
82 |
83 | allftrs = batch_predict([args.image], net)
84 |
85 | scipy.io.savemat(os.path.join(base_dir, 'vgg_feats.mat'), mdict = {'feats': np.transpose(allftrs)})
86 |
--------------------------------------------------------------------------------
/scripts/features.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from keras.utils import np_utils
3 |
4 |
5 | def get_questions_tensor_timeseries(questions, nlp, timesteps):
6 | '''
7 | Returns a time series of word vectors for tokens in the question
8 |
9 | Input:
10 | questions: list of unicode objects
11 | nlp: an instance of the class English() from spacy.en
12 | timesteps: the number of
13 |
14 | Output:
15 | A numpy ndarray of shape: (nb_samples, timesteps, word_vec_dim)
16 | '''
17 | assert not isinstance(questions, basestring)
18 | nb_samples = len(questions)
19 | word_vec_dim = nlp(questions[0])[0].vector.shape[0]
20 | questions_tensor = np.zeros((nb_samples, timesteps, word_vec_dim))
21 | for i in xrange(len(questions)):
22 | tokens = nlp(questions[i])
23 | for j in xrange(len(tokens)):
24 | if j0:
68 | model.add(Dropout(args.dropout))
69 | for i in xrange(args.num_hidden_layers-1):
70 | model.add(Dense(args.num_hidden_units, init='uniform'))
71 | model.add(Activation(args.activation))
72 | if args.dropout>0:
73 | model.add(Dropout(args.dropout))
74 | model.add(Dense(nb_classes, init='uniform'))
75 | model.add(Activation('softmax'))
76 |
77 | json_string = model.to_json()
78 | if args.language_only:
79 | model_file_name = '../models/mlp_language_only_num_hidden_units_' + str(args.num_hidden_units) + '_num_hidden_layers_' + str(args.num_hidden_layers)
80 | else:
81 | model_file_name = '../models/mlp_num_hidden_units_' + str(args.num_hidden_units) + '_num_hidden_layers_' + str(args.num_hidden_layers)
82 | open(model_file_name + '.json', 'w').write(json_string)
83 |
84 | print 'Compiling model...'
85 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
86 | print 'Compilation done...'
87 |
88 | print 'Training started...'
89 | for k in xrange(args.num_epochs):
90 | #shuffle the data points before going through them
91 | index_shuf = range(len(questions_train))
92 | shuffle(index_shuf)
93 | questions_train = [questions_train[i] for i in index_shuf]
94 | answers_train = [answers_train[i] for i in index_shuf]
95 | images_train = [images_train[i] for i in index_shuf]
96 | progbar = generic_utils.Progbar(len(questions_train))
97 | for qu_batch,an_batch,im_batch in zip(grouper(questions_train, args.batch_size, fillvalue=questions_train[-1]),
98 | grouper(answers_train, args.batch_size, fillvalue=answers_train[-1]),
99 | grouper(images_train, args.batch_size, fillvalue=images_train[-1])):
100 | X_q_batch = get_questions_matrix_sum(qu_batch, nlp)
101 | if args.language_only:
102 | X_batch = X_q_batch
103 | else:
104 | X_i_batch = get_images_matrix(im_batch, id_map, VGGfeatures)
105 | X_batch = np.hstack((X_q_batch, X_i_batch))
106 | Y_batch = get_answers_matrix(an_batch, labelencoder)
107 | loss = model.train_on_batch(X_batch, Y_batch)
108 | progbar.add(args.batch_size, values=[("train loss", loss)])
109 | #print type(loss)
110 | if k%args.model_save_interval == 0:
111 | model.save_weights(model_file_name + '_epoch_{:02d}.hdf5'.format(k))
112 |
113 | model.save_weights(model_file_name + '_epoch_{:02d}.hdf5'.format(k))
114 |
115 | if __name__ == "__main__":
116 | main()
--------------------------------------------------------------------------------
/scripts/utils.py:
--------------------------------------------------------------------------------
1 | import operator
2 | from itertools import izip_longest
3 | from collections import defaultdict
4 |
5 | def selectFrequentAnswers(questions_train, answers_train, images_train, maxAnswers):
6 | answer_fq= defaultdict(int)
7 | #build a dictionary of answers
8 | for answer in answers_train:
9 | answer_fq[answer] += 1
10 |
11 | sorted_fq = sorted(answer_fq.items(), key=operator.itemgetter(1), reverse=True)[0:maxAnswers]
12 | top_answers, top_fq = zip(*sorted_fq)
13 | new_answers_train=[]
14 | new_questions_train=[]
15 | new_images_train=[]
16 | #only those answer which appear int he top 1K are used for training
17 | for answer,question,image in zip(answers_train, questions_train, images_train):
18 | if answer in top_answers:
19 | new_answers_train.append(answer)
20 | new_questions_train.append(question)
21 | new_images_train.append(image)
22 |
23 | return (new_questions_train,new_answers_train,new_images_train)
24 |
25 | def grouper(iterable, n, fillvalue=None):
26 | args = [iter(iterable)] * n
27 | return izip_longest(*args, fillvalue=fillvalue)
--------------------------------------------------------------------------------
/scripts/vgg_features.prototxt:
--------------------------------------------------------------------------------
1 | name: "VGG_ILSVRC_16_layers"
2 | input: "data"
3 | input_dim: 10
4 | input_dim: 3
5 | input_dim: 224
6 | input_dim: 224
7 | layers {
8 | bottom: "data"
9 | top: "conv1_1"
10 | name: "conv1_1"
11 | type: CONVOLUTION
12 | convolution_param {
13 | num_output: 64
14 | pad: 1
15 | kernel_size: 3
16 | }
17 | }
18 | layers {
19 | bottom: "conv1_1"
20 | top: "conv1_1"
21 | name: "relu1_1"
22 | type: RELU
23 | }
24 | layers {
25 | bottom: "conv1_1"
26 | top: "conv1_2"
27 | name: "conv1_2"
28 | type: CONVOLUTION
29 | convolution_param {
30 | num_output: 64
31 | pad: 1
32 | kernel_size: 3
33 | }
34 | }
35 | layers {
36 | bottom: "conv1_2"
37 | top: "conv1_2"
38 | name: "relu1_2"
39 | type: RELU
40 | }
41 | layers {
42 | bottom: "conv1_2"
43 | top: "pool1"
44 | name: "pool1"
45 | type: POOLING
46 | pooling_param {
47 | pool: MAX
48 | kernel_size: 2
49 | stride: 2
50 | }
51 | }
52 | layers {
53 | bottom: "pool1"
54 | top: "conv2_1"
55 | name: "conv2_1"
56 | type: CONVOLUTION
57 | convolution_param {
58 | num_output: 128
59 | pad: 1
60 | kernel_size: 3
61 | }
62 | }
63 | layers {
64 | bottom: "conv2_1"
65 | top: "conv2_1"
66 | name: "relu2_1"
67 | type: RELU
68 | }
69 | layers {
70 | bottom: "conv2_1"
71 | top: "conv2_2"
72 | name: "conv2_2"
73 | type: CONVOLUTION
74 | convolution_param {
75 | num_output: 128
76 | pad: 1
77 | kernel_size: 3
78 | }
79 | }
80 | layers {
81 | bottom: "conv2_2"
82 | top: "conv2_2"
83 | name: "relu2_2"
84 | type: RELU
85 | }
86 | layers {
87 | bottom: "conv2_2"
88 | top: "pool2"
89 | name: "pool2"
90 | type: POOLING
91 | pooling_param {
92 | pool: MAX
93 | kernel_size: 2
94 | stride: 2
95 | }
96 | }
97 | layers {
98 | bottom: "pool2"
99 | top: "conv3_1"
100 | name: "conv3_1"
101 | type: CONVOLUTION
102 | convolution_param {
103 | num_output: 256
104 | pad: 1
105 | kernel_size: 3
106 | }
107 | }
108 | layers {
109 | bottom: "conv3_1"
110 | top: "conv3_1"
111 | name: "relu3_1"
112 | type: RELU
113 | }
114 | layers {
115 | bottom: "conv3_1"
116 | top: "conv3_2"
117 | name: "conv3_2"
118 | type: CONVOLUTION
119 | convolution_param {
120 | num_output: 256
121 | pad: 1
122 | kernel_size: 3
123 | }
124 | }
125 | layers {
126 | bottom: "conv3_2"
127 | top: "conv3_2"
128 | name: "relu3_2"
129 | type: RELU
130 | }
131 | layers {
132 | bottom: "conv3_2"
133 | top: "conv3_3"
134 | name: "conv3_3"
135 | type: CONVOLUTION
136 | convolution_param {
137 | num_output: 256
138 | pad: 1
139 | kernel_size: 3
140 | }
141 | }
142 | layers {
143 | bottom: "conv3_3"
144 | top: "conv3_3"
145 | name: "relu3_3"
146 | type: RELU
147 | }
148 | layers {
149 | bottom: "conv3_3"
150 | top: "pool3"
151 | name: "pool3"
152 | type: POOLING
153 | pooling_param {
154 | pool: MAX
155 | kernel_size: 2
156 | stride: 2
157 | }
158 | }
159 | layers {
160 | bottom: "pool3"
161 | top: "conv4_1"
162 | name: "conv4_1"
163 | type: CONVOLUTION
164 | convolution_param {
165 | num_output: 512
166 | pad: 1
167 | kernel_size: 3
168 | }
169 | }
170 | layers {
171 | bottom: "conv4_1"
172 | top: "conv4_1"
173 | name: "relu4_1"
174 | type: RELU
175 | }
176 | layers {
177 | bottom: "conv4_1"
178 | top: "conv4_2"
179 | name: "conv4_2"
180 | type: CONVOLUTION
181 | convolution_param {
182 | num_output: 512
183 | pad: 1
184 | kernel_size: 3
185 | }
186 | }
187 | layers {
188 | bottom: "conv4_2"
189 | top: "conv4_2"
190 | name: "relu4_2"
191 | type: RELU
192 | }
193 | layers {
194 | bottom: "conv4_2"
195 | top: "conv4_3"
196 | name: "conv4_3"
197 | type: CONVOLUTION
198 | convolution_param {
199 | num_output: 512
200 | pad: 1
201 | kernel_size: 3
202 | }
203 | }
204 | layers {
205 | bottom: "conv4_3"
206 | top: "conv4_3"
207 | name: "relu4_3"
208 | type: RELU
209 | }
210 | layers {
211 | bottom: "conv4_3"
212 | top: "pool4"
213 | name: "pool4"
214 | type: POOLING
215 | pooling_param {
216 | pool: MAX
217 | kernel_size: 2
218 | stride: 2
219 | }
220 | }
221 | layers {
222 | bottom: "pool4"
223 | top: "conv5_1"
224 | name: "conv5_1"
225 | type: CONVOLUTION
226 | convolution_param {
227 | num_output: 512
228 | pad: 1
229 | kernel_size: 3
230 | }
231 | }
232 | layers {
233 | bottom: "conv5_1"
234 | top: "conv5_1"
235 | name: "relu5_1"
236 | type: RELU
237 | }
238 | layers {
239 | bottom: "conv5_1"
240 | top: "conv5_2"
241 | name: "conv5_2"
242 | type: CONVOLUTION
243 | convolution_param {
244 | num_output: 512
245 | pad: 1
246 | kernel_size: 3
247 | }
248 | }
249 | layers {
250 | bottom: "conv5_2"
251 | top: "conv5_2"
252 | name: "relu5_2"
253 | type: RELU
254 | }
255 | layers {
256 | bottom: "conv5_2"
257 | top: "conv5_3"
258 | name: "conv5_3"
259 | type: CONVOLUTION
260 | convolution_param {
261 | num_output: 512
262 | pad: 1
263 | kernel_size: 3
264 | }
265 | }
266 | layers {
267 | bottom: "conv5_3"
268 | top: "conv5_3"
269 | name: "relu5_3"
270 | type: RELU
271 | }
272 | layers {
273 | bottom: "conv5_3"
274 | top: "pool5"
275 | name: "pool5"
276 | type: POOLING
277 | pooling_param {
278 | pool: MAX
279 | kernel_size: 2
280 | stride: 2
281 | }
282 | }
283 | layers {
284 | bottom: "pool5"
285 | top: "fc6"
286 | name: "fc6"
287 | type: INNER_PRODUCT
288 | inner_product_param {
289 | num_output: 4096
290 | }
291 | }
292 | layers {
293 | bottom: "fc6"
294 | top: "fc6"
295 | name: "relu6"
296 | type: RELU
297 | }
298 | layers {
299 | bottom: "fc6"
300 | top: "fc6"
301 | name: "drop6"
302 | type: DROPOUT
303 | dropout_param {
304 | dropout_ratio: 0.5
305 | }
306 | }
307 | layers {
308 | bottom: "fc6"
309 | top: "fc7"
310 | name: "fc7"
311 | type: INNER_PRODUCT
312 | inner_product_param {
313 | num_output: 4096
314 | }
315 | }
316 | layers {
317 | bottom: "fc7"
318 | top: "fc7"
319 | name: "relu7"
320 | type: RELU
321 | }
322 |
--------------------------------------------------------------------------------