├── .gitignore ├── LICENSE.txt ├── README.md ├── data ├── README.md ├── download.sh └── preprocessed │ └── README.md ├── experiments ├── generate_submission_test.py └── train_lstm_1_vqa_test.py ├── features ├── README.md ├── coco_vgg_IDMap.txt └── download.sh ├── models ├── README.md ├── lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3.json └── lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5 ├── results └── README.md └── scripts ├── README.md ├── demo_batch.py ├── dumpText.py ├── evaluateLSTM.py ├── evaluateMLP.py ├── extract_features.py ├── features.py ├── get_started.sh ├── own_image.py ├── trainLSTM_1.py ├── trainLSTM_language.py ├── trainMLP.py ├── utils.py └── vgg_features.prototxt /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.pyo -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Avi Singh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning for Visual Question Answering 2 | 3 | [Click here](https://avisingh599.github.io/deeplearning/visual-qa/) to go to the accompanying blog post. 4 | 5 | This project uses Keras to train a variety of **Feedforward** and **Recurrent Neural Networks** for the task of Visual Question Answering. It is designed to work with the [VQA](http://visualqa.org) dataset. 6 | 7 | Models Implemented: 8 | 9 | |BOW+CNN Model | LSTM + CNN Model | 10 | |--------------------------------------|-------------------------| 11 | | alt text

| 12 | 13 | 14 | ## Requirements 15 | 1. [Keras 0.20](http://keras.io/) 16 | 2. [spaCy 0.94](http://spacy.io/) 17 | 3. [scikit-learn 0.16](http://scikit-learn.org/) 18 | 4. [progressbar](https://pypi.python.org/pypi/progressbar) 19 | 5. Nvidia CUDA 7.5 (optional, for GPU acceleration) 20 | 6. Caffe (Optional) 21 | 22 | Tested with Python 2.7 on Ubuntu 14.04 and Centos 7.1. 23 | 24 | **Notes**: 25 | 26 | 1. Keras needs the latest Theano, which in turn needs Numpy/Scipy. 27 | 2. spaCy is currently used only for converting questions to a vector (or a sequence of vectors), this dependency can be easily be removed if you want to. 28 | 3. spaCy uses Goldberg and Levy's word vectors by default, but I found the performance to be much superior with Stanford's [Glove word vectors](http://nlp.stanford.edu/projects/glove/). 29 | 4. VQA Tools is **not** needed. 30 | 5. Caffe (Optional) - For using the VQA with your own images. 31 | 32 | ## Installation Guide 33 | This project has a large number of dependecies, and I am yet to make a comprehensive installation guide. In the meanwhile, you can use the following guide made by @gajumaru4444: 34 | 35 | 1. [Prepare for VQA in Ubuntu 14.04 x64 Part 1](https://gajumaru4444.github.io/2015/11/10/Visual-Question-Answering-2.html) 36 | 2. [Prepare for VQA in Ubuntu 14.04 x64 Part 2](https://gajumaru4444.github.io/2015/11/18/Visual-Question-Answering-3.html) 37 | 38 | If you intend to use my pre-trained models, you would also need to replace spaCy's default word vectors with the GloVe word vectors from Stanford. You can find more details [here](http://spacy.io/tutorials/load-new-word-vectors/) on how to do this. 39 | 40 | ## Using Pre-trained models 41 | Take a look at `scripts/demo_batch.py`. An LSTM-based pre-trained model has been released. It currently works only on the images of the MS COCO dataset (need to be downloaded separately), since I have pre-computed the VGG features for them. I do intend to add a pipeline for computing features for other images. 42 | 43 | **Caution**: Use the pre-trained model with 300D Common Crawl Glove Word Embeddings. Do not the the default spaCy embeddings (Goldberg and Levy 2014). If you try to use these pre-trained models with any embeddings except Glove, your results would be **garbage**. You can find more deatails [here](http://spacy.io/tutorials/load-new-word-vectors/) on how to do this. 44 | 45 | ## Using your own images 46 | 47 | Now you can use your own images with the `scripts/own_image.py` script. Use it like : 48 | 49 | python own_image.py --caffe /path/to/caffe 50 | 51 | For now, a Caffe installation is required. However, I'm working on a Keras based VGG Net which should be up soon. Download the VGG Caffe model weights from [here](http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_16_layers.caffemodel) and place it in the scripts folder. 52 | 53 | ## The Numbers 54 | Performance on the **validation set** and the **test-dev set** of the [VQA Challenge](http://visualqa.org/challenge.html): 55 | 56 | | Model | val | test-dev | 57 | | ---------------------|:-------------:|:-------------:| 58 | | BOW+CNN | 48.46% | TODO | 59 | | LSTM-Language only | 44.17% | TODO | 60 | | LSTM+CNN | 51.63% | 53.34% | 61 | 62 | Note: For validation set, the model was trained on the training set, while it was trained on both training and validation set for the test-dev set results. 63 | 64 | There is a **lot** of scope for hyperparameter tuning here. Experiments were done for 100 epochs. 65 | 66 | Training Time on various hardware: 67 | 68 | | Model | GTX 760 | Intel Core i7 | 69 | | ---------------------|:-------------------:|:-------------------:| 70 | | BOW+CNN | 140 seconds/epoch | 900 seconds/epoch | 71 | | LSTM+CNN | 200 seconds/epoch | 1900 seconds/epoch | 72 | 73 | The above numbers are valid when using a batch size of `128`, and training on 215K examples in every epoch. 74 | 75 | ## Get Started 76 | Have a look at the `get_started.sh` script in the `scripts` folder. Also, have a look at the readme present in each of the folders. 77 | 78 | ## Feedback 79 | All kind of feedback (code style, bugs, comments etc.) is welcome. Please open an issue on this repo instead of mailing me, since it helps me keep track of things better. 80 | 81 | ## License 82 | MIT 83 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | Download and unzip the VQA dataset from here: 2 | http://www.visualqa.org/ 3 | 4 | or you can use the download script for the same. -------------------------------------------------------------------------------- /data/download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Downloads the training and validation sets from visualqa.org. 3 | 4 | wget http://visualqa.org/data/mscoco/vqa/Questions_Train_mscoco.zip 5 | wget http://visualqa.org/data/mscoco/vqa/Questions_Val_mscoco.zip 6 | wget http://visualqa.org/data/mscoco/vqa/Annotations_Train_mscoco.zip 7 | wget http://visualqa.org/data/mscoco/vqa/Annotations_Val_mscoco.zip 8 | 9 | unzip \*.zip -------------------------------------------------------------------------------- /data/preprocessed/README.md: -------------------------------------------------------------------------------- 1 | This is where all the text files are dumped by the dumpText.py script. -------------------------------------------------------------------------------- /experiments/generate_submission_test.py: -------------------------------------------------------------------------------- 1 | import json 2 | import sys 3 | import argparse 4 | from progressbar import Bar, ETA, Percentage, ProgressBar 5 | from keras.models import model_from_json 6 | 7 | from spacy.en import English 8 | import numpy as np 9 | import scipy.io 10 | from sklearn.externals import joblib 11 | 12 | sys.path.insert(0, '../scripts/') 13 | from features import get_questions_tensor_timeseries, get_images_matrix, get_answers_matrix 14 | from utils import grouper 15 | 16 | def main(): 17 | 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument('-model', type=str, required=True) 20 | parser.add_argument('-weights', type=str, required=True) 21 | parser.add_argument('-results', type=str, required=True) 22 | args = parser.parse_args() 23 | 24 | model = model_from_json(open(args.model).read()) 25 | model.load_weights(args.weights) 26 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop') 27 | 28 | questions_test = open('../data/preprocessed/questions_test-dev2015.txt', 29 | 'r').read().decode('utf8').splitlines() 30 | questions_lengths_test = open('../data/preprocessed/questions_lengths_test-dev2015.txt', 31 | 'r').read().decode('utf8').splitlines() 32 | questions_id_test = open('../data/preprocessed/questions_id_test-dev2015.txt', 33 | 'r').read().decode('utf8').splitlines() 34 | images_test = open('../data/preprocessed/images_test-dev2015.txt', 35 | 'r').read().decode('utf8').splitlines() 36 | vgg_model_path = '../features/coco/vgg_feats_test.mat' 37 | 38 | questions_lengths_test, questions_test, images_test, questions_id_test = (list(t) for t in zip(*sorted(zip(questions_lengths_test, questions_test, images_test, questions_id_test)))) 39 | 40 | print 'Model compiled, weights loaded' 41 | labelencoder = joblib.load('../models/labelencoder_trainval.pkl') 42 | 43 | features_struct = scipy.io.loadmat(vgg_model_path) 44 | VGGfeatures = features_struct['feats'] 45 | print 'Loaded vgg features' 46 | image_ids = open('../features/coco_vgg_IDMap_test.txt').read().splitlines() 47 | img_map = {} 48 | for ids in image_ids: 49 | id_split = ids.split() 50 | img_map[id_split[0]] = int(id_split[1]) 51 | 52 | nlp = English() 53 | print 'Loaded word2vec features' 54 | 55 | nb_classes = 1000 56 | y_predict_text = [] 57 | batchSize = 128 58 | widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'), 59 | ' ', ETA()] 60 | pbar = ProgressBar(widgets=widgets) 61 | 62 | for qu_batch,im_batch in pbar(zip(grouper(questions_test, batchSize, fillvalue=questions_test[-1]), 63 | grouper(images_test, batchSize, fillvalue=images_test[-1]))): 64 | timesteps = len(nlp(qu_batch[-1])) #questions sorted in descending order of length 65 | X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, timesteps) 66 | if 'language_only' in args.model: 67 | X_batch = X_q_batch 68 | else: 69 | X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures) 70 | X_batch = [X_q_batch, X_i_batch] 71 | y_predict = model.predict_classes(X_batch, verbose=0) 72 | y_predict_text.extend(labelencoder.inverse_transform(y_predict)) 73 | 74 | results = [] 75 | 76 | f1 = open(args.results, 'w') 77 | for prediction, question, question_id, image in zip(y_predict_text, questions_test, questions_id_test, images_test): 78 | answer = {} 79 | answer['question_id'] = int(question_id) 80 | answer['answer'] = prediction 81 | results.append(answer) 82 | 83 | f1.write(question.encode('utf-8')) 84 | f1.write('\n') 85 | f1.write(image.encode('utf-8')) 86 | f1.write('\n') 87 | f1.write(prediction) 88 | f1.write('\n') 89 | f1.write(question_id.encode('utf-8')) 90 | f1.write('\n') 91 | f1.write('\n') 92 | 93 | f1.close() 94 | 95 | f2 = open('../results/submission_test-dev2015.json', 'w') 96 | f2.write(json.dumps(results)) 97 | f2.close() 98 | print 'Results saved to', args.results 99 | 100 | if __name__ == "__main__": 101 | main() -------------------------------------------------------------------------------- /experiments/train_lstm_1_vqa_test.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy.io 3 | import sys 4 | import argparse 5 | 6 | from keras.models import Sequential 7 | from keras.layers.core import Dense, Activation, Merge, Dropout, Reshape 8 | from keras.layers.recurrent import LSTM 9 | from keras.utils import np_utils, generic_utils 10 | from keras.callbacks import ModelCheckpoint, RemoteMonitor 11 | 12 | from sklearn.externals import joblib 13 | from sklearn import preprocessing 14 | 15 | from spacy.en import English 16 | 17 | sys.path.insert(0, '../scripts/') 18 | from utils import grouper, selectFrequentAnswers 19 | from features import get_images_matrix, get_answers_matrix, get_questions_tensor_timeseries 20 | 21 | 22 | def main(): 23 | parser = argparse.ArgumentParser() 24 | parser.add_argument('-num_hidden_units_mlp', type=int, default=1024) 25 | parser.add_argument('-num_hidden_units_lstm', type=int, default=512) 26 | parser.add_argument('-num_hidden_layers_mlp', type=int, default=3) 27 | parser.add_argument('-num_hidden_layers_lstm', type=int, default=1) 28 | parser.add_argument('-dropout', type=float, default=0.5) 29 | parser.add_argument('-activation_mlp', type=str, default='tanh') 30 | parser.add_argument('-num_epochs', type=int, default=100) 31 | parser.add_argument('-model_save_interval', type=int, default=5) 32 | parser.add_argument('-batch_size', type=int, default=128) 33 | #TODO Feature parser.add_argument('-resume_training', type=str) 34 | #TODO Feature parser.add_argument('-language_only', type=bool, default= False) 35 | args = parser.parse_args() 36 | 37 | word_vec_dim= 300 38 | img_dim = 4096 39 | max_len = 30 40 | nb_classes = 1000 41 | 42 | #get the data 43 | questions_train = open('../data/preprocessed/questions_train2014.txt', 'r').read().decode('utf8').splitlines() 44 | questions_lengths_train = open('../data/preprocessed/questions_lengths_train2014.txt', 'r').read().decode('utf8').splitlines() 45 | answers_train = open('../data/preprocessed/answers_train2014_modal.txt', 'r').read().decode('utf8').splitlines() 46 | images_train = open('../data/preprocessed/images_train2014.txt', 'r').read().decode('utf8').splitlines() 47 | 48 | questions_val = open('../data/preprocessed/questions_val2014.txt', 'r').read().decode('utf8').splitlines() 49 | questions_lengths_val = open('../data/preprocessed/questions_lengths_val2014.txt', 'r').read().decode('utf8').splitlines() 50 | answers_val = open('../data/preprocessed/answers_val2014_modal.txt', 'r').read().decode('utf8').splitlines() 51 | images_val = open('../data/preprocessed/images_val2014.txt', 'r').read().decode('utf8').splitlines() 52 | 53 | questions_train = questions_train + questions_val 54 | questions_lengths_train = questions_lengths_train + questions_lengths_val 55 | answers_train = answers_train + answers_val 56 | images_train = images_train + images_val 57 | 58 | vgg_model_path = '../features/coco/vgg_feats.mat' 59 | 60 | max_answers = nb_classes 61 | questions_train, answers_train, images_train = selectFrequentAnswers(questions_train,answers_train,images_train, max_answers) 62 | questions_lengths_train, questions_train, answers_train, images_train = (list(t) for t in zip(*sorted(zip(questions_lengths_train, questions_train, answers_train, images_train)))) 63 | 64 | #encode the remaining answers 65 | labelencoder = preprocessing.LabelEncoder() 66 | labelencoder.fit(answers_train) 67 | nb_classes = len(list(labelencoder.classes_)) 68 | joblib.dump(labelencoder,'../models/labelencoder_trainval.pkl') 69 | 70 | image_model = Sequential() 71 | image_model.add(Reshape(input_shape = (img_dim,), dims=(img_dim,))) 72 | 73 | language_model = Sequential() 74 | if args.num_hidden_layers_lstm == 1: 75 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=False, input_shape=(max_len, word_vec_dim))) 76 | else: 77 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=True, input_shape=(max_len, word_vec_dim))) 78 | for i in xrange(args.num_hidden_layers_lstm-2): 79 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=True)) 80 | language_model.add(LSTM(output_dim = args.num_hidden_units_lstm, return_sequences=False)) 81 | 82 | model = Sequential() 83 | model.add(Merge([language_model, image_model], mode='concat', concat_axis=1)) 84 | for i in xrange(args.num_hidden_layers_mlp): 85 | model.add(Dense(args.num_hidden_units_mlp, init='uniform')) 86 | model.add(Activation(args.activation_mlp)) 87 | model.add(Dropout(args.dropout)) 88 | model.add(Dense(nb_classes)) 89 | model.add(Activation('softmax')) 90 | 91 | json_string = model.to_json() 92 | model_file_name = '../models/FULL_lstm_1_num_hidden_units_lstm_' + str(args.num_hidden_units_lstm) + \ 93 | '_num_hidden_units_mlp_' + str(args.num_hidden_units_mlp) + '_num_hidden_layers_mlp_' + \ 94 | str(args.num_hidden_layers_mlp) + '_num_hidden_layers_lstm_' + str(args.num_hidden_layers_lstm) 95 | open(model_file_name + '.json', 'w').write(json_string) 96 | 97 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop') 98 | print 'Compilation done' 99 | 100 | features_struct = scipy.io.loadmat(vgg_model_path) 101 | VGGfeatures = features_struct['feats'] 102 | print 'loaded vgg features' 103 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines() 104 | img_map = {} 105 | for ids in image_ids: 106 | id_split = ids.split() 107 | img_map[id_split[0]] = int(id_split[1]) 108 | 109 | nlp = English() 110 | print 'loaded word2vec features...' 111 | ## training 112 | print 'Training started...' 113 | for k in xrange(args.num_epochs): 114 | 115 | progbar = generic_utils.Progbar(len(questions_train)) 116 | 117 | for qu_batch,an_batch,im_batch in zip(grouper(questions_train, args.batch_size, fillvalue=questions_train[-1]), 118 | grouper(answers_train, args.batch_size, fillvalue=answers_train[-1]), 119 | grouper(images_train, args.batch_size, fillvalue=images_train[-1])): 120 | timesteps = len(nlp(qu_batch[-1])) #questions sorted in descending order of length 121 | X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, timesteps) 122 | X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures) 123 | Y_batch = get_answers_matrix(an_batch, labelencoder) 124 | loss = model.train_on_batch([X_q_batch, X_i_batch], Y_batch) 125 | progbar.add(args.batch_size, values=[("train loss", loss)]) 126 | 127 | 128 | if k%args.model_save_interval == 0: 129 | model.save_weights(model_file_name + '_epoch_{:03d}.hdf5'.format(k)) 130 | 131 | model.save_weights(model_file_name + '_epoch_{:03d}.hdf5'.format(k)) 132 | 133 | if __name__ == "__main__": 134 | main() -------------------------------------------------------------------------------- /features/README.md: -------------------------------------------------------------------------------- 1 | Download and unzip the features from here: 2 | http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip -------------------------------------------------------------------------------- /features/download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Downloads and unzips the VGG features computed on the COCO dataset. 3 | 4 | wget http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip 5 | unzip coco.zip -d . -------------------------------------------------------------------------------- /models/README.md: -------------------------------------------------------------------------------- 1 | This folder will contain all the model configurations in the json files and all the model weights in the hdf5 or h5 files. 2 | -------------------------------------------------------------------------------- /models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3.json: -------------------------------------------------------------------------------- 1 | {"layers": [{"layers": [{"layers": [{"truncate_gradient": -1, "name": "LSTM", "inner_activation": "hard_sigmoid", "activation": "tanh", "input_shape": [30, 300], "init": "glorot_uniform", "inner_init": "orthogonal", "input_dim": null, "return_sequences": false, "output_dim": 512, "forget_bias_init": "one", "input_length": null}], "name": "Sequential"}, {"layers": [{"dims": [4096], "name": "Reshape", "input_shape": [4096]}], "name": "Sequential"}], "mode": "concat", "name": "Merge", "concat_axis": 1}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1024}, {"beta": 0.1, "activation": "tanh", "name": "Activation", "target": 0}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1024}, {"beta": 0.1, "activation": "tanh", "name": "Activation", "target": 0}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1024}, {"beta": 0.1, "activation": "tanh", "name": "Activation", "target": 0}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "linear", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 1000}, {"beta": 0.1, "activation": "softmax", "name": "Activation", "target": 0}], "name": "Sequential"} -------------------------------------------------------------------------------- /models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avisingh599/visual-qa/99be95d61bf9302495e741fa53cf63b7e9a91a35/models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5 -------------------------------------------------------------------------------- /results/README.md: -------------------------------------------------------------------------------- 1 | This folder contains the predictions made by the different models for the different models. The ```overall_results.txt``` contains the performance of all the individual models. -------------------------------------------------------------------------------- /scripts/README.md: -------------------------------------------------------------------------------- 1 | Here is the utility of the various files: 2 | 3 | 0. `demo_batch.py`: You need access to pretrained models (included in the repo to run this example) 4 | 5 | 1. `get_started.sh`: Downloads data, VQAtools, pre-computed features, and trains a model. Run this script when you are done with the dependencies. 6 | 7 | 2. `dumpText.py`: Dumps the questions and answers from the VQA json files to some text files for later ease of use. Run `python dumpText.py -h` for more info. 8 | 9 | 3. `trainMLP.py`: Trains Multi-Layer perceptrons. Run `python trainMLP.py -h` for more info. 10 | 11 | 4. `trainLSTM_1.py`: Trains LSTM-based model. Run `python trainLSTM_1.py -h` for more info. 12 | 13 | 6. `trainLSTM_language.py`: Trains LSTM-based language-only model. Run `python trainLSTM_language.py -h` for more info. 14 | 15 | 7. `evaluateMLP.py`: Evaluates models trained by `trainMLP.py`. Needs model json file, hdf5 weights file, and output txt file destinations to run. 16 | 17 | 8. `evaluateLSTM.py`: Evaluates models trained by `trainLSTM_1.py` and `trainLSTM_language.py`. Needs model json file, hdf5 weights file, and output txt file destinations to run. 18 | 19 | 9. `features.py`: Contains functions that are used to convert images and words to vectors (or sequences of vectors). 20 | 21 | 10. `utils.py`: Exactly what you think. 22 | 23 | 11. `own_image.py`: Use your own image. Caffe installation required 24 | 25 | 12. `extract_features.py`: Extract 4096D VGG features from a VGG Caffe Model 26 | 27 | 13. `vgg_features.prototxt`: VGG Caffe Model Definition 28 | -------------------------------------------------------------------------------- /scripts/demo_batch.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import random 3 | from PIL import Image 4 | import subprocess 5 | from os import listdir 6 | from os.path import isfile, join 7 | 8 | from keras.models import model_from_json 9 | 10 | from spacy.en import English 11 | import numpy as np 12 | import scipy.io 13 | from sklearn.externals import joblib 14 | 15 | from features import get_questions_tensor_timeseries, get_images_matrix, get_answers_matrix 16 | 17 | def main(): 18 | ''' 19 | Before runnning this demo ensure that you have some images from the MS COCO validation set 20 | saved somewhere, and update the image_dir variable accordingly 21 | Also, this demo is designed to run with the models released with the visual-qa repo, if you 22 | would like to get use it with some other model (say an MLP based model or a langauge-only model) 23 | you will have to make some changes. 24 | ''' 25 | image_dir = '../../vqa_images/' 26 | local_images = [ f for f in listdir(image_dir) if isfile(join(image_dir,f)) ] 27 | 28 | parser = argparse.ArgumentParser() 29 | parser.add_argument('-model', type=str, default='../models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3.json') 30 | parser.add_argument('-weights', type=str, default='../models/lstm_1_num_hidden_units_lstm_512_num_hidden_units_mlp_1024_num_hidden_layers_mlp_3_epoch_070.hdf5') 31 | parser.add_argument('-sample_size', type=int, default=25) 32 | args = parser.parse_args() 33 | 34 | model = model_from_json(open(args.model).read()) 35 | model.load_weights(args.weights) 36 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop') 37 | print 'Model loaded and compiled' 38 | images_val = open('../data/preprocessed/images_val2014.txt', 39 | 'r').read().decode('utf8').splitlines() 40 | 41 | nlp = English() 42 | print 'Loaded word2vec features' 43 | labelencoder = joblib.load('../models/labelencoder.pkl') 44 | 45 | vgg_model_path = '../features/coco/vgg_feats.mat' 46 | features_struct = scipy.io.loadmat(vgg_model_path) 47 | VGGfeatures = features_struct['feats'] 48 | print 'Loaded vgg features' 49 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines() 50 | img_map = {} 51 | for ids in image_ids: 52 | id_split = ids.split() 53 | img_map[id_split[0]] = int(id_split[1]) 54 | 55 | image_sample = random.sample(local_images, args.sample_size) 56 | 57 | for image in image_sample: 58 | p = subprocess.Popen(["display", image_dir + image]) 59 | q = unicode(raw_input("Ask a question about the image:")) 60 | coco_id = str(int(image[-16:-4])) 61 | timesteps = len(nlp(q)) #questions sorted in descending order of length 62 | X_q = get_questions_tensor_timeseries([q], nlp, timesteps) 63 | X_i = get_images_matrix([coco_id], img_map, VGGfeatures) 64 | X = [X_q, X_i] 65 | y_predict = model.predict_classes(X, verbose=0) 66 | print labelencoder.inverse_transform(y_predict) 67 | raw_input('Press enter to continue...') 68 | p.kill() 69 | 70 | if __name__ == "__main__": 71 | main() 72 | -------------------------------------------------------------------------------- /scripts/dumpText.py: -------------------------------------------------------------------------------- 1 | import operator 2 | import argparse 3 | import progressbar 4 | import json 5 | from spacy.en import English 6 | 7 | def getModalAnswer(answers): 8 | candidates = {} 9 | for i in xrange(10): 10 | candidates[answers[i]['answer']] = 1 11 | 12 | for i in xrange(10): 13 | candidates[answers[i]['answer']] += 1 14 | 15 | return max(candidates.iteritems(), key=operator.itemgetter(1))[0] 16 | 17 | def getAllAnswer(answers): 18 | answer_list = [] 19 | for i in xrange(10): 20 | answer_list.append(answers[i]['answer']) 21 | 22 | return ';'.join(answer_list) 23 | 24 | def main(): 25 | parser = argparse.ArgumentParser() 26 | parser.add_argument('-split', type=str, default='train', 27 | help='Specify which part of the dataset you want to dump to text. Your options are: train, val, test, test-dev') 28 | parser.add_argument('-answers', type=str, default='modal', 29 | help='Specify if you want to dump just the most frequent answer for each questions (modal), or all the answers (all)') 30 | args = parser.parse_args() 31 | 32 | nlp = English() #used for conting number of tokens 33 | 34 | if args.split == 'train': 35 | annFile = '../data/mscoco_train2014_annotations.json' 36 | quesFile = '../data/OpenEnded_mscoco_train2014_questions.json' 37 | questions_file = open('../data/preprocessed/questions_train2014.txt', 'w') 38 | questions_id_file = open('../data/preprocessed/questions_id_train2014.txt', 'w') 39 | questions_lengths_file = open('../data/preprocessed/questions_lengths_train2014.txt', 'w') 40 | if args.answers == 'modal': 41 | answers_file = open('../data/preprocessed/answers_train2014_modal.txt', 'w') 42 | elif args.answers == 'all': 43 | answers_file = open('../data/preprocessed/answers_train2014_all.txt', 'w') 44 | coco_image_id = open('../data/preprocessed/images_train2014.txt', 'w') 45 | data_split = 'training data' 46 | elif args.split == 'val': 47 | annFile = '../data/mscoco_val2014_annotations.json' 48 | quesFile = '../data/OpenEnded_mscoco_val2014_questions.json' 49 | questions_file = open('../data/preprocessed/questions_val2014.txt', 'w') 50 | questions_id_file = open('../data/preprocessed/questions_id_val2014.txt', 'w') 51 | questions_lengths_file = open('../data/preprocessed/questions_lengths_val2014.txt', 'w') 52 | if args.answers == 'modal': 53 | answers_file = open('../data/preprocessed/answers_val2014_modal.txt', 'w') 54 | elif args.answers == 'all': 55 | answers_file = open('../data/preprocessed/answers_val2014_all.txt', 'w') 56 | coco_image_id = open('../data/preprocessed/images_val2014_all.txt', 'w') 57 | data_split = 'validation data' 58 | elif args.split == 'test-dev': 59 | quesFile = '../data/OpenEnded_mscoco_test-dev2015_questions.json' 60 | questions_file = open('../data/preprocessed/questions_test-dev2015.txt', 'w') 61 | questions_id_file = open('../data/preprocessed/questions_id_test-dev2015.txt', 'w') 62 | questions_lengths_file = open('../data/preprocessed/questions_lengths_test-dev2015.txt', 'w') 63 | coco_image_id = open('../data/preprocessed/images_test-dev2015.txt', 'w') 64 | data_split = 'test-dev data' 65 | elif args.split == 'test': 66 | quesFile = '../data/OpenEnded_mscoco_test2015_questions.json' 67 | questions_file = open('../data/preprocessed/questions_test2015.txt', 'w') 68 | questions_id_file = open('../data/preprocessed/questions_id_test2015.txt', 'w') 69 | questions_lengths_file = open('../data/preprocessed/questions_lengths_test2015.txt', 'w') 70 | coco_image_id = open('../data/preprocessed/images_test2015.txt', 'w') 71 | data_split = 'test data' 72 | else: 73 | raise RuntimeError('Incorrect split. Your choices are:\ntrain\nval\ntest-dev\ntest') 74 | 75 | #initialize VQA api for QA annotations 76 | #vqa=VQA(annFile, quesFile) 77 | questions = json.load(open(quesFile, 'r')) 78 | ques = questions['questions'] 79 | if args.split == 'train' or args.split == 'val': 80 | qa = json.load(open(annFile, 'r')) 81 | qa = qa['annotations'] 82 | 83 | pbar = progressbar.ProgressBar() 84 | print 'Dumping questions, answers, questionIDs, imageIDs, and questions lengths to text files...' 85 | for i, q in pbar(zip(xrange(len(ques)),ques)): 86 | questions_file.write((q['question'] + '\n').encode('utf8')) 87 | questions_lengths_file.write((str(len(nlp(q['question'])))+ '\n').encode('utf8')) 88 | questions_id_file.write((str(q['question_id']) + '\n').encode('utf8')) 89 | coco_image_id.write((str(q['image_id']) + '\n').encode('utf8')) 90 | if args.split == 'train' or args.split == 'val': 91 | if args.answers == 'modal': 92 | answers_file.write(getModalAnswer(qa[i]['answers']).encode('utf8')) 93 | elif args.answers == 'all': 94 | answers_file.write(getAllAnswer(qa[i]['answers']).encode('utf8')) 95 | answers_file.write('\n'.encode('utf8')) 96 | 97 | print 'completed dumping', data_split 98 | 99 | if __name__ == "__main__": 100 | main() -------------------------------------------------------------------------------- /scripts/evaluateLSTM.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from progressbar import Bar, ETA, Percentage, ProgressBar 3 | from keras.models import model_from_json 4 | 5 | from spacy.en import English 6 | import numpy as np 7 | import scipy.io 8 | from sklearn.externals import joblib 9 | 10 | from features import get_questions_tensor_timeseries, get_images_matrix, get_answers_matrix 11 | from utils import grouper 12 | 13 | def main(): 14 | 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument('-model', type=str, required=True) 17 | parser.add_argument('-weights', type=str, required=True) 18 | parser.add_argument('-results', type=str, required=True) 19 | args = parser.parse_args() 20 | 21 | model = model_from_json(open(args.model).read()) 22 | model.load_weights(args.weights) 23 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop') 24 | 25 | questions_val = open('../data/preprocessed/questions_val2014.txt', 26 | 'r').read().decode('utf8').splitlines() 27 | questions_lengths_val = open('../data/preprocessed/questions_lengths_val2014.txt', 28 | 'r').read().decode('utf8').splitlines() 29 | answers_val = open('../data/preprocessed/answers_val2014_all.txt', 30 | 'r').read().decode('utf8').splitlines() 31 | images_val = open('../data/preprocessed/images_val2014.txt', 32 | 'r').read().decode('utf8').splitlines() 33 | vgg_model_path = '../features/coco/vgg_feats.mat' 34 | 35 | questions_lengths_val, questions_val, answers_val, images_val = (list(t) for t in zip(*sorted(zip(questions_lengths_val, questions_val, answers_val, images_val)))) 36 | 37 | print 'Model compiled, weights loaded' 38 | labelencoder = joblib.load('../models/labelencoder.pkl') 39 | 40 | features_struct = scipy.io.loadmat(vgg_model_path) 41 | VGGfeatures = features_struct['feats'] 42 | print 'Loaded vgg features' 43 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines() 44 | img_map = {} 45 | for ids in image_ids: 46 | id_split = ids.split() 47 | img_map[id_split[0]] = int(id_split[1]) 48 | 49 | nlp = English() 50 | print 'Loaded word2vec features' 51 | 52 | nb_classes = 1000 53 | y_predict_text = [] 54 | batchSize = 128 55 | widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'), 56 | ' ', ETA()] 57 | pbar = ProgressBar(widgets=widgets) 58 | 59 | for qu_batch,an_batch,im_batch in pbar(zip(grouper(questions_val, batchSize, fillvalue=questions_val[0]), 60 | grouper(answers_val, batchSize, fillvalue=answers_val[0]), 61 | grouper(images_val, batchSize, fillvalue=images_val[0]))): 62 | timesteps = len(nlp(qu_batch[-1])) #questions sorted in descending order of length 63 | X_q_batch = get_questions_tensor_timeseries(qu_batch, nlp, timesteps) 64 | if 'language_only' in args.model: 65 | X_batch = X_q_batch 66 | else: 67 | X_i_batch = get_images_matrix(im_batch, img_map, VGGfeatures) 68 | X_batch = [X_q_batch, X_i_batch] 69 | y_predict = model.predict_classes(X_batch, verbose=0) 70 | y_predict_text.extend(labelencoder.inverse_transform(y_predict)) 71 | 72 | total = 0 73 | correct_val=0.0 74 | f1 = open(args.results, 'w') 75 | for prediction, truth, question, image in zip(y_predict_text, answers_val, questions_val, images_val): 76 | temp_count=0 77 | for _truth in truth.split(';'): 78 | if prediction == _truth: 79 | temp_count+=1 80 | 81 | if temp_count>2: 82 | correct_val+=1 83 | else: 84 | correct_val+=float(temp_count)/3 85 | 86 | total+=1 87 | 88 | f1.write(question.encode('utf-8')) 89 | f1.write('\n') 90 | f1.write(image.encode('utf-8')) 91 | f1.write('\n') 92 | f1.write(prediction) 93 | f1.write('\n') 94 | f1.write(truth.encode('utf-8')) 95 | f1.write('\n') 96 | f1.write('\n') 97 | 98 | f1.write('Final Accuracy is ' + str(correct_val/total)) 99 | f1.close() 100 | f1 = open('../results/overall_results.txt', 'a') 101 | f1.write(args.weights + '\n') 102 | f1.write(str(correct_val/total) + '\n\n') 103 | f1.close() 104 | print 'Final Accuracy on the validation set is', correct_val/total 105 | 106 | if __name__ == "__main__": 107 | main() -------------------------------------------------------------------------------- /scripts/evaluateMLP.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import argparse 3 | from progressbar import Bar, ETA, Percentage, ProgressBar 4 | from keras.models import model_from_json 5 | 6 | from spacy.en import English 7 | import numpy as np 8 | import scipy.io 9 | from sklearn.externals import joblib 10 | 11 | from features import get_questions_matrix_sum, get_images_matrix, get_answers_matrix 12 | from utils import grouper 13 | 14 | def main(): 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument('-model', type=str, required=True) 17 | parser.add_argument('-weights', type=str, required=True) 18 | parser.add_argument('-results', type=str, required=True) 19 | args = parser.parse_args() 20 | 21 | model = model_from_json(open(args.model).read()) 22 | model.load_weights(args.weights) 23 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop') 24 | 25 | questions_val = open('../data/preprocessed/questions_val2014.txt', 26 | 'r').read().decode('utf8').splitlines() 27 | answers_val = open('../data/preprocessed/answers_val2014_all.txt', 28 | 'r').read().decode('utf8').splitlines() 29 | images_val = open('../data/preprocessed/images_val2014.txt', 30 | 'r').read().decode('utf8').splitlines() 31 | vgg_model_path = '../features/coco/vgg_feats.mat' 32 | 33 | print 'Model compiled, weights loaded...' 34 | labelencoder = joblib.load('../models/labelencoder.pkl') 35 | 36 | features_struct = scipy.io.loadmat(vgg_model_path) 37 | VGGfeatures = features_struct['feats'] 38 | print 'loaded vgg features' 39 | image_ids = open('../features/coco_vgg_IDMap.txt').read().splitlines() 40 | img_map = {} 41 | for ids in image_ids: 42 | id_split = ids.split() 43 | img_map[id_split[0]] = int(id_split[1]) 44 | 45 | nlp = English() 46 | print 'loaded word2vec features' 47 | 48 | nb_classes = 1000 49 | y_predict_text = [] 50 | batchSize = 128 51 | widgets = ['Evaluating ', Percentage(), ' ', Bar(marker='#',left='[',right=']'), 52 | ' ', ETA()] 53 | pbar = ProgressBar(widgets=widgets) 54 | 55 | for qu_batch,an_batch,im_batch in pbar(zip(grouper(questions_val, batchSize, fillvalue=questions_val[0]), 56 | grouper(answers_val, batchSize, fillvalue=answers_val[0]), 57 | grouper(images_val, batchSize, fillvalue=images_val[0]))): 58 | X_q_batch = get_questions_matrix_sum(qu_batch, nlp) 59 | if 'language_only' in args.model: 60 | X_batch = X_q_batch 61 | else: 62 | X_i_batch = get_images_matrix(im_batch, img_map , VGGfeatures) 63 | X_batch = np.hstack((X_q_batch, X_i_batch)) 64 | y_predict = model.predict_classes(X_batch, verbose=0) 65 | y_predict_text.extend(labelencoder.inverse_transform(y_predict)) 66 | 67 | correct_val=0.0 68 | total=0 69 | f1 = open(args.results, 'w') 70 | 71 | for prediction, truth, question, image in zip(y_predict_text, answers_val, questions_val, images_val): 72 | temp_count=0 73 | for _truth in truth.split(';'): 74 | if prediction == _truth: 75 | temp_count+=1 76 | 77 | if temp_count>2: 78 | correct_val+=1 79 | else: 80 | correct_val+= float(temp_count)/3 81 | 82 | total+=1 83 | f1.write(question.encode('utf-8')) 84 | f1.write('\n') 85 | f1.write(image.encode('utf-8')) 86 | f1.write('\n') 87 | f1.write(prediction) 88 | f1.write('\n') 89 | f1.write(truth.encode('utf-8')) 90 | f1.write('\n') 91 | f1.write('\n') 92 | 93 | f1.write('Final Accuracy is ' + str(correct_val/total)) 94 | f1.close() 95 | f1 = open('../results/overall_results.txt', 'a') 96 | f1.write(args.weights + '\n') 97 | f1.write(str(correct_val/total) + '\n') 98 | f1.close() 99 | print 'Final Accuracy on the validation set is', correct_val/total 100 | 101 | if __name__ == "__main__": 102 | main() 103 | -------------------------------------------------------------------------------- /scripts/extract_features.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os.path 3 | import argparse 4 | 5 | import numpy as np 6 | from scipy.misc import imread, imresize 7 | import scipy.io 8 | 9 | parser = argparse.ArgumentParser() 10 | parser.add_argument('--caffe', help='path to caffe installation') 11 | parser.add_argument('--model_def', help='path to model definition prototxt') 12 | parser.add_argument('--model', help='path to model parameters') 13 | parser.add_argument('--gpu', action='store_true', help='whether to use gpu') 14 | parser.add_argument('--image', help='path to image') 15 | 16 | args = parser.parse_args() 17 | 18 | if args.caffe: 19 | caffepath = args.caffe + '/python' 20 | sys.path.append(caffepath) 21 | 22 | import caffe 23 | 24 | def predict(in_data, net): 25 | 26 | out = net.forward(**{net.inputs[0]: in_data}) 27 | features = out[net.outputs[0]] 28 | return features 29 | 30 | 31 | def batch_predict(filenames, net): 32 | 33 | N, C, H, W = net.blobs[net.inputs[0]].data.shape 34 | F = net.blobs[net.outputs[0]].data.shape[1] 35 | Nf = len(filenames) 36 | Hi, Wi, _ = imread(filenames[0]).shape 37 | allftrs = np.zeros((Nf, F)) 38 | for i in range(0, Nf, N): 39 | in_data = np.zeros((N, C, H, W), dtype=np.float32) 40 | 41 | batch_range = range(i, min(i+N, Nf)) 42 | batch_filenames = [filenames[j] for j in batch_range] 43 | Nb = len(batch_range) 44 | 45 | batch_images = np.zeros((Nb, 3, H, W)) 46 | for j,fname in enumerate(batch_filenames): 47 | im = imread(fname) 48 | if len(im.shape) == 2: 49 | im = np.tile(im[:,:,np.newaxis], (1,1,3)) 50 | # RGB -> BGR 51 | im = im[:,:,(2,1,0)] 52 | # mean subtraction 53 | im = im - np.array([103.939, 116.779, 123.68]) 54 | # resize 55 | im = imresize(im, (H, W), 'bicubic') 56 | # get channel in correct dimension 57 | im = np.transpose(im, (2, 0, 1)) 58 | batch_images[j,:,:,:] = im 59 | 60 | # insert into correct place 61 | in_data[0:len(batch_range), :, :, :] = batch_images 62 | 63 | # predict features 64 | ftrs = predict(in_data, net) 65 | 66 | for j in range(len(batch_range)): 67 | allftrs[i+j,:] = ftrs[j,:] 68 | 69 | print 'Done %d/%d files' % (i+len(batch_range), len(filenames)) 70 | 71 | return allftrs 72 | 73 | 74 | if args.gpu: 75 | caffe.set_mode_gpu() 76 | else: 77 | caffe.set_mode_cpu() 78 | 79 | net = caffe.Net(args.model_def, args.model, caffe.TEST) 80 | 81 | base_dir = os.path.dirname(args.image) 82 | 83 | allftrs = batch_predict([args.image], net) 84 | 85 | scipy.io.savemat(os.path.join(base_dir, 'vgg_feats.mat'), mdict = {'feats': np.transpose(allftrs)}) 86 | -------------------------------------------------------------------------------- /scripts/features.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from keras.utils import np_utils 3 | 4 | 5 | def get_questions_tensor_timeseries(questions, nlp, timesteps): 6 | ''' 7 | Returns a time series of word vectors for tokens in the question 8 | 9 | Input: 10 | questions: list of unicode objects 11 | nlp: an instance of the class English() from spacy.en 12 | timesteps: the number of 13 | 14 | Output: 15 | A numpy ndarray of shape: (nb_samples, timesteps, word_vec_dim) 16 | ''' 17 | assert not isinstance(questions, basestring) 18 | nb_samples = len(questions) 19 | word_vec_dim = nlp(questions[0])[0].vector.shape[0] 20 | questions_tensor = np.zeros((nb_samples, timesteps, word_vec_dim)) 21 | for i in xrange(len(questions)): 22 | tokens = nlp(questions[i]) 23 | for j in xrange(len(tokens)): 24 | if j0: 68 | model.add(Dropout(args.dropout)) 69 | for i in xrange(args.num_hidden_layers-1): 70 | model.add(Dense(args.num_hidden_units, init='uniform')) 71 | model.add(Activation(args.activation)) 72 | if args.dropout>0: 73 | model.add(Dropout(args.dropout)) 74 | model.add(Dense(nb_classes, init='uniform')) 75 | model.add(Activation('softmax')) 76 | 77 | json_string = model.to_json() 78 | if args.language_only: 79 | model_file_name = '../models/mlp_language_only_num_hidden_units_' + str(args.num_hidden_units) + '_num_hidden_layers_' + str(args.num_hidden_layers) 80 | else: 81 | model_file_name = '../models/mlp_num_hidden_units_' + str(args.num_hidden_units) + '_num_hidden_layers_' + str(args.num_hidden_layers) 82 | open(model_file_name + '.json', 'w').write(json_string) 83 | 84 | print 'Compiling model...' 85 | model.compile(loss='categorical_crossentropy', optimizer='rmsprop') 86 | print 'Compilation done...' 87 | 88 | print 'Training started...' 89 | for k in xrange(args.num_epochs): 90 | #shuffle the data points before going through them 91 | index_shuf = range(len(questions_train)) 92 | shuffle(index_shuf) 93 | questions_train = [questions_train[i] for i in index_shuf] 94 | answers_train = [answers_train[i] for i in index_shuf] 95 | images_train = [images_train[i] for i in index_shuf] 96 | progbar = generic_utils.Progbar(len(questions_train)) 97 | for qu_batch,an_batch,im_batch in zip(grouper(questions_train, args.batch_size, fillvalue=questions_train[-1]), 98 | grouper(answers_train, args.batch_size, fillvalue=answers_train[-1]), 99 | grouper(images_train, args.batch_size, fillvalue=images_train[-1])): 100 | X_q_batch = get_questions_matrix_sum(qu_batch, nlp) 101 | if args.language_only: 102 | X_batch = X_q_batch 103 | else: 104 | X_i_batch = get_images_matrix(im_batch, id_map, VGGfeatures) 105 | X_batch = np.hstack((X_q_batch, X_i_batch)) 106 | Y_batch = get_answers_matrix(an_batch, labelencoder) 107 | loss = model.train_on_batch(X_batch, Y_batch) 108 | progbar.add(args.batch_size, values=[("train loss", loss)]) 109 | #print type(loss) 110 | if k%args.model_save_interval == 0: 111 | model.save_weights(model_file_name + '_epoch_{:02d}.hdf5'.format(k)) 112 | 113 | model.save_weights(model_file_name + '_epoch_{:02d}.hdf5'.format(k)) 114 | 115 | if __name__ == "__main__": 116 | main() -------------------------------------------------------------------------------- /scripts/utils.py: -------------------------------------------------------------------------------- 1 | import operator 2 | from itertools import izip_longest 3 | from collections import defaultdict 4 | 5 | def selectFrequentAnswers(questions_train, answers_train, images_train, maxAnswers): 6 | answer_fq= defaultdict(int) 7 | #build a dictionary of answers 8 | for answer in answers_train: 9 | answer_fq[answer] += 1 10 | 11 | sorted_fq = sorted(answer_fq.items(), key=operator.itemgetter(1), reverse=True)[0:maxAnswers] 12 | top_answers, top_fq = zip(*sorted_fq) 13 | new_answers_train=[] 14 | new_questions_train=[] 15 | new_images_train=[] 16 | #only those answer which appear int he top 1K are used for training 17 | for answer,question,image in zip(answers_train, questions_train, images_train): 18 | if answer in top_answers: 19 | new_answers_train.append(answer) 20 | new_questions_train.append(question) 21 | new_images_train.append(image) 22 | 23 | return (new_questions_train,new_answers_train,new_images_train) 24 | 25 | def grouper(iterable, n, fillvalue=None): 26 | args = [iter(iterable)] * n 27 | return izip_longest(*args, fillvalue=fillvalue) -------------------------------------------------------------------------------- /scripts/vgg_features.prototxt: -------------------------------------------------------------------------------- 1 | name: "VGG_ILSVRC_16_layers" 2 | input: "data" 3 | input_dim: 10 4 | input_dim: 3 5 | input_dim: 224 6 | input_dim: 224 7 | layers { 8 | bottom: "data" 9 | top: "conv1_1" 10 | name: "conv1_1" 11 | type: CONVOLUTION 12 | convolution_param { 13 | num_output: 64 14 | pad: 1 15 | kernel_size: 3 16 | } 17 | } 18 | layers { 19 | bottom: "conv1_1" 20 | top: "conv1_1" 21 | name: "relu1_1" 22 | type: RELU 23 | } 24 | layers { 25 | bottom: "conv1_1" 26 | top: "conv1_2" 27 | name: "conv1_2" 28 | type: CONVOLUTION 29 | convolution_param { 30 | num_output: 64 31 | pad: 1 32 | kernel_size: 3 33 | } 34 | } 35 | layers { 36 | bottom: "conv1_2" 37 | top: "conv1_2" 38 | name: "relu1_2" 39 | type: RELU 40 | } 41 | layers { 42 | bottom: "conv1_2" 43 | top: "pool1" 44 | name: "pool1" 45 | type: POOLING 46 | pooling_param { 47 | pool: MAX 48 | kernel_size: 2 49 | stride: 2 50 | } 51 | } 52 | layers { 53 | bottom: "pool1" 54 | top: "conv2_1" 55 | name: "conv2_1" 56 | type: CONVOLUTION 57 | convolution_param { 58 | num_output: 128 59 | pad: 1 60 | kernel_size: 3 61 | } 62 | } 63 | layers { 64 | bottom: "conv2_1" 65 | top: "conv2_1" 66 | name: "relu2_1" 67 | type: RELU 68 | } 69 | layers { 70 | bottom: "conv2_1" 71 | top: "conv2_2" 72 | name: "conv2_2" 73 | type: CONVOLUTION 74 | convolution_param { 75 | num_output: 128 76 | pad: 1 77 | kernel_size: 3 78 | } 79 | } 80 | layers { 81 | bottom: "conv2_2" 82 | top: "conv2_2" 83 | name: "relu2_2" 84 | type: RELU 85 | } 86 | layers { 87 | bottom: "conv2_2" 88 | top: "pool2" 89 | name: "pool2" 90 | type: POOLING 91 | pooling_param { 92 | pool: MAX 93 | kernel_size: 2 94 | stride: 2 95 | } 96 | } 97 | layers { 98 | bottom: "pool2" 99 | top: "conv3_1" 100 | name: "conv3_1" 101 | type: CONVOLUTION 102 | convolution_param { 103 | num_output: 256 104 | pad: 1 105 | kernel_size: 3 106 | } 107 | } 108 | layers { 109 | bottom: "conv3_1" 110 | top: "conv3_1" 111 | name: "relu3_1" 112 | type: RELU 113 | } 114 | layers { 115 | bottom: "conv3_1" 116 | top: "conv3_2" 117 | name: "conv3_2" 118 | type: CONVOLUTION 119 | convolution_param { 120 | num_output: 256 121 | pad: 1 122 | kernel_size: 3 123 | } 124 | } 125 | layers { 126 | bottom: "conv3_2" 127 | top: "conv3_2" 128 | name: "relu3_2" 129 | type: RELU 130 | } 131 | layers { 132 | bottom: "conv3_2" 133 | top: "conv3_3" 134 | name: "conv3_3" 135 | type: CONVOLUTION 136 | convolution_param { 137 | num_output: 256 138 | pad: 1 139 | kernel_size: 3 140 | } 141 | } 142 | layers { 143 | bottom: "conv3_3" 144 | top: "conv3_3" 145 | name: "relu3_3" 146 | type: RELU 147 | } 148 | layers { 149 | bottom: "conv3_3" 150 | top: "pool3" 151 | name: "pool3" 152 | type: POOLING 153 | pooling_param { 154 | pool: MAX 155 | kernel_size: 2 156 | stride: 2 157 | } 158 | } 159 | layers { 160 | bottom: "pool3" 161 | top: "conv4_1" 162 | name: "conv4_1" 163 | type: CONVOLUTION 164 | convolution_param { 165 | num_output: 512 166 | pad: 1 167 | kernel_size: 3 168 | } 169 | } 170 | layers { 171 | bottom: "conv4_1" 172 | top: "conv4_1" 173 | name: "relu4_1" 174 | type: RELU 175 | } 176 | layers { 177 | bottom: "conv4_1" 178 | top: "conv4_2" 179 | name: "conv4_2" 180 | type: CONVOLUTION 181 | convolution_param { 182 | num_output: 512 183 | pad: 1 184 | kernel_size: 3 185 | } 186 | } 187 | layers { 188 | bottom: "conv4_2" 189 | top: "conv4_2" 190 | name: "relu4_2" 191 | type: RELU 192 | } 193 | layers { 194 | bottom: "conv4_2" 195 | top: "conv4_3" 196 | name: "conv4_3" 197 | type: CONVOLUTION 198 | convolution_param { 199 | num_output: 512 200 | pad: 1 201 | kernel_size: 3 202 | } 203 | } 204 | layers { 205 | bottom: "conv4_3" 206 | top: "conv4_3" 207 | name: "relu4_3" 208 | type: RELU 209 | } 210 | layers { 211 | bottom: "conv4_3" 212 | top: "pool4" 213 | name: "pool4" 214 | type: POOLING 215 | pooling_param { 216 | pool: MAX 217 | kernel_size: 2 218 | stride: 2 219 | } 220 | } 221 | layers { 222 | bottom: "pool4" 223 | top: "conv5_1" 224 | name: "conv5_1" 225 | type: CONVOLUTION 226 | convolution_param { 227 | num_output: 512 228 | pad: 1 229 | kernel_size: 3 230 | } 231 | } 232 | layers { 233 | bottom: "conv5_1" 234 | top: "conv5_1" 235 | name: "relu5_1" 236 | type: RELU 237 | } 238 | layers { 239 | bottom: "conv5_1" 240 | top: "conv5_2" 241 | name: "conv5_2" 242 | type: CONVOLUTION 243 | convolution_param { 244 | num_output: 512 245 | pad: 1 246 | kernel_size: 3 247 | } 248 | } 249 | layers { 250 | bottom: "conv5_2" 251 | top: "conv5_2" 252 | name: "relu5_2" 253 | type: RELU 254 | } 255 | layers { 256 | bottom: "conv5_2" 257 | top: "conv5_3" 258 | name: "conv5_3" 259 | type: CONVOLUTION 260 | convolution_param { 261 | num_output: 512 262 | pad: 1 263 | kernel_size: 3 264 | } 265 | } 266 | layers { 267 | bottom: "conv5_3" 268 | top: "conv5_3" 269 | name: "relu5_3" 270 | type: RELU 271 | } 272 | layers { 273 | bottom: "conv5_3" 274 | top: "pool5" 275 | name: "pool5" 276 | type: POOLING 277 | pooling_param { 278 | pool: MAX 279 | kernel_size: 2 280 | stride: 2 281 | } 282 | } 283 | layers { 284 | bottom: "pool5" 285 | top: "fc6" 286 | name: "fc6" 287 | type: INNER_PRODUCT 288 | inner_product_param { 289 | num_output: 4096 290 | } 291 | } 292 | layers { 293 | bottom: "fc6" 294 | top: "fc6" 295 | name: "relu6" 296 | type: RELU 297 | } 298 | layers { 299 | bottom: "fc6" 300 | top: "fc6" 301 | name: "drop6" 302 | type: DROPOUT 303 | dropout_param { 304 | dropout_ratio: 0.5 305 | } 306 | } 307 | layers { 308 | bottom: "fc6" 309 | top: "fc7" 310 | name: "fc7" 311 | type: INNER_PRODUCT 312 | inner_product_param { 313 | num_output: 4096 314 | } 315 | } 316 | layers { 317 | bottom: "fc7" 318 | top: "fc7" 319 | name: "relu7" 320 | type: RELU 321 | } 322 | --------------------------------------------------------------------------------