├── .gitignore ├── README.md ├── coco-caption └── myeval.py ├── coco └── coco_preprocess.ipynb ├── convert_checkpoint_gpu_to_cpu.lua ├── cv ├── README.md ├── driver.py ├── inspect_cv.ipynb ├── killall.sh ├── runworker.sh └── spawn.sh ├── eval.lua ├── misc ├── DataLoader.lua ├── DataLoaderRaw.lua ├── LSTM.lua ├── LanguageModel.lua ├── call_python_caption_eval.sh ├── gradcheck.lua ├── net_utils.lua ├── optim_updates.lua └── utils.lua ├── prepro.py ├── test_language_model.lua ├── train.lua ├── videocaptioning.lua └── vis ├── imgs └── dummy ├── index.html ├── jquery-1.8.3.min.js └── teaser.jpeg /.gitignore: -------------------------------------------------------------------------------- 1 | coco/ 2 | coco-caption/ 3 | model/ 4 | .ipynb_checkpoints/ 5 | vis/imgs/ 6 | vis/vis.json 7 | testimages/ 8 | checkpoints/ 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # NeuralTalk2 3 | 4 | **Update (September 22, 2016)**: The Google Brain team has [released the image captioning model](https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html) of Vinyals et al. (2015). The core model is very similar to NeuralTalk2 (a CNN followed by RNN), but the Google release should work significantly better as a result of better CNN, some tricks, and more careful engineering. Find it under [im2txt](https://github.com/tensorflow/models/tree/master/im2txt/im2txt) repo in tensorflow. I'll leave this code base up for educational purposes and as a Torch implementation. 5 | 6 | Recurrent Neural Network captions your images. Now much faster and better than the original [NeuralTalk](https://github.com/karpathy/neuraltalk). Compared to the original NeuralTalk this implementation is **batched, uses Torch, runs on a GPU, and supports CNN finetuning**. All of these together result in quite a large increase in training speed for the Language Model (~100x), but overall not as much because we also have to forward a VGGNet. However, overall very good models can be trained in 2-3 days, and they show a much better performance. 7 | 8 | This is an early code release that works great but is slightly hastily released and probably requires some code reading of inline comments (which I tried to be quite good with in general). I will be improving it over time but wanted to push the code out there because I promised it to too many people. 9 | 10 | This current code (and the pretrained model) gets ~0.9 CIDEr, which would place it around spot #8 on the [codalab leaderboard](https://competitions.codalab.org/competitions/3221#results). I will submit the actual result soon. 11 | 12 |  13 | 14 | You can find a few more example results on the [demo page](http://cs.stanford.edu/people/karpathy/neuraltalk2/demo.html). These results will improve a bit more once the last few bells and whistles are in place (e.g. beam search, ensembling, reranking). 15 | 16 | There's also a [fun video](https://vimeo.com/146492001) by [@kcimc](https://twitter.com/kcimc), where he runs a neuraltalk2 pretrained model in real time on his laptop during a walk in Amsterdam. 17 | 18 | ### Requirements 19 | 20 | 21 | #### For evaluation only 22 | 23 | This code is written in Lua and requires [Torch](http://torch.ch/). If you're on Ubuntu, installing Torch in your home directory may look something like: 24 | 25 | ```bash 26 | $ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash 27 | $ git clone https://github.com/torch/distro.git ~/torch --recursive 28 | $ cd ~/torch; 29 | $ ./install.sh # and enter "yes" at the end to modify your bashrc 30 | $ source ~/.bashrc 31 | ``` 32 | 33 | See the Torch installation documentation for more details. After Torch is installed we need to get a few more packages using [LuaRocks](https://luarocks.org/) (which already came with the Torch install). In particular: 34 | 35 | ```bash 36 | $ luarocks install nn 37 | $ luarocks install nngraph 38 | $ luarocks install image 39 | ``` 40 | 41 | We're also going to need the [cjson](http://www.kyne.com.au/~mark/software/lua-cjson-manual.html) library so that we can load/save json files. Follow their [download link](http://www.kyne.com.au/~mark/software/lua-cjson.php) and then look under their section 2.4 for easy luarocks install. 42 | 43 | If you'd like to run on an NVIDIA GPU using CUDA (which you really, really want to if you're training a model, since we're using a VGGNet), you'll of course need a GPU, and you will have to install the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit). Then get the `cutorch` and `cunn` packages: 44 | 45 | ```bash 46 | $ luarocks install cutorch 47 | $ luarocks install cunn 48 | ``` 49 | 50 | If you'd like to use the cudnn backend (the pretrained checkpoint does), you also have to install [cudnn](https://github.com/soumith/cudnn.torch). First follow the link to [NVIDIA website](https://developer.nvidia.com/cuDNN), register with them and download the cudnn library. Then make sure you adjust your `LD_LIBRARY_PATH` to point to the `lib64` folder that contains the library (e.g. `libcudnn.so.7.0.64`). Then git clone the `cudnn.torch` repo, `cd` inside and do `luarocks make cudnn-scm-1.rockspec` to build the Torch bindings. 51 | 52 | #### For training 53 | 54 | If you'd like to train your models you will need [loadcaffe](https://github.com/szagoruyko/loadcaffe), since we are using the VGGNet. First, make sure you follow their instructions to install `protobuf` and everything else (e.g. `sudo apt-get install libprotobuf-dev protobuf-compiler`), and then install via luarocks: 55 | 56 | ```bash 57 | luarocks install loadcaffe 58 | ``` 59 | 60 | Finally, you will also need to install [torch-hdf5](https://github.com/deepmind/torch-hdf5), and [h5py](http://www.h5py.org/), since we will be using hdf5 files to store the preprocessed data. 61 | 62 | Phew! Quite a few dependencies, sorry no easy way around it :\ 63 | 64 | ### I just want to caption images 65 | 66 | In this case you want to run the evaluation script on a pretrained model checkpoint. 67 | I trained a decent one on the [MS COCO dataset](http://mscoco.org/) that you can run on your images. 68 | The pretrained checkpoint can be downloaded here: [pretrained checkpoint link](http://cs.stanford.edu/people/karpathy/neuraltalk2/checkpoint_v1.zip) (600MB). It's large because it contains the weights of a finetuned VGGNet. Now place all your images of interest into a folder, e.g. `blah`, and run 69 | the eval script: 70 | 71 | ```bash 72 | $ th eval.lua -model /path/to/model -image_folder /path/to/image/directory -num_images 10 73 | ``` 74 | 75 | This tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size` (default = 1). Use `-num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface: 76 | 77 | ```bash 78 | $ cd vis 79 | $ python -m SimpleHTTPServer 80 | ``` 81 | 82 | Now visit `localhost:8000` in your browser and you should see your predicted captions. 83 | 84 | You can see an [example visualization demo page here](http://cs.stanford.edu/people/karpathy/neuraltalk2/demo.html). 85 | 86 | **Running in Docker**. If you'd like to avoid dependency nightmares, running the codebase from Docker might be a good option. There is one (third-party) [docker repo here](https://github.com/beeva-enriqueotero/docker-neuraltalk2). 87 | 88 | **"I only have CPU"**. Okay, in that case download the [cpu model checkpoint](http://cs.stanford.edu/people/karpathy/neuraltalk2/checkpoint_v1_cpu.zip). Make sure you run the eval script with `-gpuid -1` to tell the script to run on CPU. On my machine it takes a bit less than 1 second per image to caption in CPU mode. 89 | 90 | **Beam Search**. Beam search is enabled by default because it increases the performance of the search for argmax decoding sequence. However, this is a little more expensive, so if you'd like to evaluate images faster, but at a cost of performance, use `-beam_size 1`. For example, in one of my experiments beam size 2 gives CIDEr 0.922, and beam size 1 gives CIDEr 0.886. 91 | 92 | **Running on MSCOCO images**. If you train on MSCOCO (see how below), you will have generated preprocessed MSCOCO images, which you can use directly in the eval script. In this case simply leave out the `image_folder` option and the eval script and instead pass in the `input_h5`, `input_json` to your preprocessed files. This will make more sense once you read the section below :) 93 | 94 | **Running a live demo**. With OpenCV 3 installed you can caption video stream from camera in real time. Follow the instructions in [torch-opencv](https://github.com/VisionLabs/torch-opencv/wiki/installation) to install it and run `videocaptioning.lua` similar to `eval.lua`. Note that only central crop will be captioned. 95 | 96 | ### I'd like to train my own network on MS COCO 97 | 98 | Great, first we need to some preprocessing. Head over to the `coco/` folder and run the IPython notebook to download the dataset and do some very simple preprocessing. The notebook will combine the train/val data together and create a very simple and small json file that contains a large list of image paths, and raw captions for each image, of the form: 99 | 100 | ``` 101 | [{ "file_path": "path/img.jpg", "captions": ["a caption", "a second caption of i"tgit ...] }, ...] 102 | ``` 103 | 104 | Once we have this, we're ready to invoke the `prepro.py` script, which will read all of this in and create a dataset (an hdf5 file and a json file) ready for consumption in the Lua code. For example, for MS COCO we can run the prepro file as follows: 105 | 106 | ```bash 107 | $ python prepro.py --input_json coco/coco_raw.json --num_val 5000 --num_test 5000 --images_root coco/images --word_count_threshold 5 --output_json coco/cocotalk.json --output_h5 coco/cocotalk.h5 108 | ``` 109 | 110 | This is telling the script to read in all the data (the images and the captions), allocate 5000 images for val/test splits respectively, and map all words that occur <= 5 times to a special `UNK` token. The resulting `json` and `h5` files are about 30GB and contain everything we want to know about the dataset. 111 | 112 | **Warning**: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See [this issue](https://github.com/karpathy/neuraltalk2/issues/4) for the fix, it involves manually replacing one image in the dataset. 113 | 114 | The last thing we need is the [VGG-16 Caffe checkpoint](http://www.robots.ox.ac.uk/~vgg/research/very_deep/), (under Models section, "16-layer model" bullet point). Put the two files (the prototxt configuration file and the proto binary of weights) somewhere (e.g. a `model` directory), and we're ready to train! 115 | 116 | ```bash 117 | $ th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json 118 | ``` 119 | 120 | The train script will take over, and start dumping checkpoints into the folder specified by `checkpoint_path` (default = current folder). You also have to point the train script to the VGGNet protos (see the options inside `train.lua`). 121 | 122 | If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use `-language_eval 1` option, but don't forget to download the [coco-caption code](https://github.com/tylin/coco-caption) into `coco-caption` directory. 123 | 124 | **A few notes on training.** To give you an idea, with the default settings one epoch of MS COCO images is about 7500 iterations. 1 epoch of training (with no finetuning - notice this is the default) takes about 1 hour and results in validation loss ~2.7 and CIDEr score of ~0.4. By iteration 70,000 CIDEr climbs up to about 0.6 (validation loss at about 2.5) and then will top out at a bit below 0.7 CIDEr. After that additional improvements are only possible by turning on CNN finetuning. I like to do the training in stages, where I first train with no finetuning, and then restart the train script with `-finetune_cnn_after 0` to start finetuning right away, and using `-start_from` flag to continue from the previous model checkpoint. You'll see your score rise up to about 0.9 CIDEr over ~2 days or so (on MS COCO). 125 | 126 | ### I'd like to train on my own data 127 | 128 | No problem, create a json file in the exact same form as before, describing your JPG files: 129 | 130 | ``` 131 | [{ "file_path": "path/img.jpg", "captions": ["a caption", "a similar caption" ...] }, ...] 132 | ``` 133 | 134 | and invoke the `prepro.py` script to preprocess all the images and data into and hdf5 file and json file. Then invoke `train.lua` (see detailed options inside code). 135 | 136 | ### I'd like to distribute my GPU trained checkpoints for CPU 137 | 138 | Use the script `convert_checkpoint_gpu_to_cpu.lua` to convert your GPU checkpoints to be usable on CPU. See inline documentation for why this separate script is needed. For example: 139 | 140 | ```bash 141 | th convert_checkpoint_gpu_to_cpu.lua gpu_checkpoint.t7 142 | ``` 143 | 144 | write the file `gpu_checkpoint.t7_cpu.t7`, which you can now run with `-gpuid -1` in the eval script. 145 | 146 | ### License 147 | 148 | BSD License. 149 | 150 | ### Acknowledgements 151 | 152 | Parts of this code were written in collaboration with my labmate [Justin Johnson](http://cs.stanford.edu/people/jcjohns/). 153 | 154 | I'm very grateful for [NVIDIA](https://developer.nvidia.com/deep-learning)'s support in providing GPUs that made this work possible. 155 | 156 | I'm also very grateful to the maintainers of Torch for maintaining a wonderful deep learning library. 157 | -------------------------------------------------------------------------------- /coco-caption/myeval.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script should be run from root directory of this codebase: 3 | https://github.com/tylin/coco-caption 4 | """ 5 | 6 | from pycocotools.coco import COCO 7 | from pycocoevalcap.eval import COCOEvalCap 8 | import json 9 | from json import encoder 10 | encoder.FLOAT_REPR = lambda o: format(o, '.3f') 11 | import sys 12 | 13 | input_json = sys.argv[1] 14 | 15 | 16 | annFile = 'annotations/captions_val2014.json' 17 | coco = COCO(annFile) 18 | valids = coco.getImgIds() 19 | 20 | checkpoint = json.load(open(input_json, 'r')) 21 | preds = checkpoint['val_predictions'] 22 | 23 | # filter results to only those in MSCOCO validation set (will be about a third) 24 | preds_filt = [p for p in preds if p['image_id'] in valids] 25 | print 'using %d/%d predictions' % (len(preds_filt), len(preds)) 26 | json.dump(preds_filt, open('tmp.json', 'w')) # serialize to temporary json file. Sigh, COCO API... 27 | 28 | resFile = 'tmp.json' 29 | cocoRes = coco.loadRes(resFile) 30 | cocoEval = COCOEvalCap(coco, cocoRes) 31 | cocoEval.params['image_id'] = cocoRes.getImgIds() 32 | cocoEval.evaluate() 33 | 34 | # create output dictionary 35 | out = {} 36 | for metric, score in cocoEval.eval.items(): 37 | out[metric] = score 38 | # serialize to file, to be read from Lua 39 | json.dump(out, open(input_json + '_out.json', 'w')) 40 | 41 | -------------------------------------------------------------------------------- /coco/coco_preprocess.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# COCO data preprocessing\n", 8 | "\n", 9 | "This code will download the caption anotations for coco and preprocess them into an hdf5 file and a json file. \n", 10 | "\n", 11 | "These will then be read by the COCO data loader in Lua and trained on." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": { 18 | "collapsed": false 19 | }, 20 | "outputs": [ 21 | { 22 | "data": { 23 | "text/plain": [ 24 | "0" 25 | ] 26 | }, 27 | "execution_count": 2, 28 | "metadata": {}, 29 | "output_type": "execute_result" 30 | } 31 | ], 32 | "source": [ 33 | "# lets download the annotations from http://mscoco.org/dataset/#download\n", 34 | "import os\n", 35 | "os.system('wget http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip') # ~19MB" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "0" 49 | ] 50 | }, 51 | "execution_count": 3, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "os.system('unzip captions_train-val2014.zip')" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "import json\n", 69 | "val = json.load(open('annotations/captions_val2014.json', 'r'))\n", 70 | "train = json.load(open('annotations/captions_train2014.json', 'r'))" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "[u'info', u'images', u'licenses', u'annotations']\n", 85 | "{u'description': u'This is stable 1.0 version of the 2014 MS COCO dataset.', u'url': u'http://mscoco.org', u'version': u'1.0', u'year': 2014, u'contributor': u'Microsoft COCO group', u'date_created': u'2015-01-27 09:11:52.357475'}\n", 86 | "40504\n", 87 | "202654\n", 88 | "{u'license': 3, u'file_name': u'COCO_val2014_000000391895.jpg', u'coco_url': u'http://mscoco.org/images/391895', u'height': 360, u'width': 640, u'date_captured': u'2013-11-14 11:18:45', u'flickr_url': u'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', u'id': 391895}\n", 89 | "{u'image_id': 203564, u'id': 37, u'caption': u'A bicycle replica with a clock as the front wheel.'}\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "print val.keys()\n", 95 | "print val['info']\n", 96 | "print len(val['images'])\n", 97 | "print len(val['annotations'])\n", 98 | "print val['images'][0]\n", 99 | "print val['annotations'][0]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 6, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "import json\n", 111 | "import os\n", 112 | "\n", 113 | "# combine all images and annotations together\n", 114 | "imgs = val['images'] + train['images']\n", 115 | "annots = val['annotations'] + train['annotations']\n", 116 | "\n", 117 | "# for efficiency lets group annotations by image\n", 118 | "itoa = {}\n", 119 | "for a in annots:\n", 120 | " imgid = a['image_id']\n", 121 | " if not imgid in itoa: itoa[imgid] = []\n", 122 | " itoa[imgid].append(a)\n", 123 | "\n", 124 | "# create the json blob\n", 125 | "out = []\n", 126 | "for i,img in enumerate(imgs):\n", 127 | " imgid = img['id']\n", 128 | " \n", 129 | " # coco specific here, they store train/val images separately\n", 130 | " loc = 'train2014' if 'train' in img['file_name'] else 'val2014'\n", 131 | " \n", 132 | " jimg = {}\n", 133 | " jimg['file_path'] = os.path.join(loc, img['file_name'])\n", 134 | " jimg['id'] = imgid\n", 135 | " \n", 136 | " sents = []\n", 137 | " annotsi = itoa[imgid]\n", 138 | " for a in annotsi:\n", 139 | " sents.append(a['caption'])\n", 140 | " jimg['captions'] = sents\n", 141 | " out.append(jimg)\n", 142 | " \n", 143 | "json.dump(out, open('coco_raw.json', 'w'))" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 7, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "{'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895}\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "# lets see what they look like\n", 163 | "print out[0]" 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 2", 170 | "language": "python", 171 | "name": "python2" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 2 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython2", 183 | "version": "2.7.6" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 0 188 | } 189 | -------------------------------------------------------------------------------- /convert_checkpoint_gpu_to_cpu.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | A quick script for converting GPU checkpoints to CPU checkpoints. 3 | CPU checkpoints are not saved by the training script automatically 4 | because of Torch cloning limitations. In particular, it is not 5 | possible to clone a GPU model on CPU, something like :clone():float() 6 | with a single call, without needing extra memory on the GPU. If this 7 | existed then it would be possible to do this inside the training 8 | script without worrying about blowing up the memory. 9 | ]]-- 10 | 11 | require 'torch' 12 | require 'nn' 13 | require 'nngraph' 14 | require 'cutorch' 15 | require 'cunn' 16 | require 'cudnn' -- only needed if the loaded model used cudnn as backend. otherwise can be commented out 17 | -- local imports 18 | require 'misc.LanguageModel' 19 | 20 | cmd = torch.CmdLine() 21 | cmd:text() 22 | cmd:text('Convert a GPU checkpoint to CPU checkpoint.') 23 | cmd:text() 24 | cmd:text('Options') 25 | cmd:argument('-model','GPU model checkpoint to convert') 26 | cmd:option('-gpuid',0,'which gpu to use. -1 = use CPU') 27 | cmd:text() 28 | 29 | -- parse input params 30 | local opt = cmd:parse(arg) 31 | torch.manualSeed(123) 32 | torch.setdefaulttensortype('torch.FloatTensor') -- for CPU 33 | cutorch.setDevice(opt.gpuid + 1) -- note +1 because lua is 1-indexed 34 | 35 | local checkpoint = torch.load(opt.model) 36 | local protos = checkpoint.protos 37 | 38 | ------------------------------------------------------------------------------- 39 | -- these functions are adapted from Michael Partheil 40 | -- https://groups.google.com/forum/#!topic/torch7/i8sJYlgQPeA 41 | -- the problem is that you can't call :float() on cudnn module, it won't convert 42 | function replaceModules(net, orig_class_name, replacer) 43 | local nodes, container_nodes = net:findModules(orig_class_name) 44 | for i = 1, #nodes do 45 | for j = 1, #(container_nodes[i].modules) do 46 | if container_nodes[i].modules[j] == nodes[i] then 47 | local orig_mod = container_nodes[i].modules[j] 48 | print('replacing a cudnn module with nn equivalent...') 49 | print(orig_mod) 50 | container_nodes[i].modules[j] = replacer(orig_mod) 51 | end 52 | end 53 | end 54 | end 55 | function cudnnNetToCpu(net) 56 | local net_cpu = net:clone():float() 57 | replaceModules(net_cpu, 'cudnn.SpatialConvolution', 58 | function(orig_mod) 59 | local cpu_mod = nn.SpatialConvolution(orig_mod.nInputPlane, orig_mod.nOutputPlane, 60 | orig_mod.kW, orig_mod.kH, orig_mod.dW, orig_mod.dH, orig_mod.padW, orig_mod.padH) 61 | cpu_mod.weight:copy(orig_mod.weight) 62 | cpu_mod.bias:copy(orig_mod.bias) 63 | cpu_mod.gradWeight = nil -- sanitize for thinner checkpoint 64 | cpu_mod.gradBias = nil -- sanitize for thinner checkpoint 65 | return cpu_mod 66 | end) 67 | replaceModules(net_cpu, 'cudnn.SpatialMaxPooling', 68 | function(orig_mod) 69 | local cpu_mod = nn.SpatialMaxPooling(orig_mod.kW, orig_mod.kH, orig_mod.dW, orig_mod.dH, 70 | orig_mod.padW, orig_mod.padH) 71 | return cpu_mod 72 | end) 73 | replaceModules(net_cpu, 'cudnn.ReLU', function() return nn.ReLU() end) 74 | return net_cpu 75 | end 76 | ------------------------------------------------------------------------------- 77 | 78 | -- convert the networks to be CPU models 79 | for k,v in pairs(protos) do 80 | print('converting ' .. k .. ' to CPU') 81 | if k == 'cnn' then 82 | -- the cnn is a troublemaker 83 | local cpu_cnn = cudnnNetToCpu(v) 84 | protos[k] = cpu_cnn 85 | elseif k == 'lm' then 86 | local debugger = require('fb.debugger'); debugger:enter() 87 | v.clones = nil -- sanitize the clones inside the language model (if present just in case. but they shouldnt be) 88 | v.lookup_tables = nil 89 | protos[k]:float() -- ship to CPU 90 | else 91 | error('error: strange module in protos: ' .. k) 92 | end 93 | end 94 | 95 | local savefile = opt.model .. '_cpu.t7' -- append "cpu.t7" to filename 96 | torch.save(savefile, checkpoint) 97 | print('saved ' .. savefile) 98 | 99 | -------------------------------------------------------------------------------- /cv/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Cross-validation utilities 3 | 4 | ### Starting workers on different GPUs 5 | 6 | I thought I should do a small code dump of my cross-validation utilities. My workflow is to run on a single machine with multiple GPUs. Each worker runs on one GPU, and I spawn workers with the `spawn.sh` script, e.g.: 7 | 8 | ```bash 9 | $ ./spawn 0 1 2 3 4 5 6 10 | ``` 11 | 12 | spawns 7 workers using GPUs 0-6 (inclusive), all running in screen sessions `ak0`...`ak6`. E.g. to attach to one of these it would be `screen -r ak0`. And `CTRL+a, d` to detach from a screen session and `CTRL+a, k, y` to kill a worker. Also `./killall.sh` to kill all workers. 13 | 14 | You can see that `spawn.sh` calls `runworker.sh` in a screen session. The runworker script can modify the paths (since LD_LIBRARY_PATH does not trasfer to inside screen sessions), and calls `driver.py`. 15 | 16 | Finally, `driver.py` runs an infinite loop of actually calling the training script, and this is where I set up all the cross-validation ranges. Also note, very importantly, how the `train.lua` script is called, with 17 | 18 | ```python 19 | cmd = 'CUDA_VISIBLE_DEVICES=%d th train.lua ' % (gpuid, ) 20 | ``` 21 | 22 | this is because otherwise Torch allocates a lot of memory on all GPUs on a single machine because it wants to support multigpu setups, but if you're only training on a single GPU you really want to use this flag to *hide* the other GPUs from each worker. 23 | 24 | Also note that I'm using the field `opt.id` to assign a unique identifier to each worker, based on the GPU it's running on and some random number, and current time, to distinguish each run. 25 | 26 | Have a look through my `driver.py` to get a sense of what it's doing. In my workflow I keep modifying this script and killing workers whenever I want to tune some of the cross-validation ranges. 27 | 28 | ### Playing with checkpoints that get written to files 29 | 30 | Finally, the IPython Notebook `inspect_cv.ipynb` gives you an idea about how I analyze the checkpoints that get written out by the workers. The notebook is *super-hacky* and not intended for plug and play use; I'm only putting it up in case this is useful to anyone to build on, and to get a sense for the kinds of analysis you might want to play with. 31 | 32 | ### Conclusion 33 | 34 | Overall, this system works quite well for me. My cluster machines run workers in screen sessions, these write checkpoints to a shared file system, and then I use notebooks to look at what hyperparameter ranges work well. Whatever works well I encode into `driver.py`, and then I restart the workers and iterate until things work well :) Hope some of this is useful & Good luck! -------------------------------------------------------------------------------- /cv/driver.py: -------------------------------------------------------------------------------- 1 | import os 2 | from random import uniform, randrange, choice 3 | import math 4 | import time 5 | import sys 6 | import json 7 | 8 | def encodev(v): 9 | if isinstance(v, float): 10 | return '%.3g' % v 11 | else: 12 | return str(v) 13 | 14 | assert len(sys.argv) > 1, 'specify gpu/rnn_size/num_layers!' 15 | gpuid = int(sys.argv[1]) 16 | 17 | cmd = 'CUDA_VISIBLE_DEVICES=%d th train.lua ' % (gpuid, ) 18 | while True: 19 | time.sleep(1.1+uniform(0,1)) 20 | 21 | opt = {} 22 | opt['id'] = '%d-%0d-%d' % (gpuid, randrange(1000), int(time.time())) 23 | opt['gpuid'] = 0 24 | opt['seed'] = 123 25 | opt['val_images_use'] = 3200 26 | opt['save_checkpoint_every'] = 2500 27 | 28 | opt['max_iters'] = -1 # run forever 29 | opt['batch_size'] = 16 30 | 31 | #opt['checkpoint_path'] = 'checkpoints' 32 | 33 | opt['language_eval'] = 1 # do eval 34 | 35 | opt['optim'] = 'adam' 36 | opt['optim_alpha'] = 0.8 37 | opt['optim_beta'] = choice([0.995, 0.999]) 38 | opt['optim_epsilon'] = 1e-8 39 | opt['learning_rate'] = 10**uniform(-5.5,-4.5) 40 | 41 | opt['finetune_cnn_after'] = -1 # dont finetune 42 | opt['cnn_optim'] = 'adam' 43 | opt['cnn_optim_alpha'] = 0.8 44 | opt['cnn_optim_beta'] = 0.995 45 | opt['cnn_learning_rate'] = 10**uniform(-5.5,-4.25) 46 | 47 | opt['drop_prob_lm'] = 0.5 48 | 49 | opt['rnn_size'] = 512 50 | opt['input_encoding_size'] = 512 51 | 52 | opt['learning_rate_decay_start'] = -1 # dont decay 53 | opt['learning_rate_decay_every'] = 50000 54 | 55 | opt['input_json'] = '/scr/r6/karpathy/cocotalk.json' 56 | opt['input_h5'] = '/scr/r6/karpathy/cocotalk.h5' 57 | 58 | #opt['start_from'] = '/scr/r6/karpathy/neuraltalk2_checkpoints/good6/model_id0-565-1447975213.t7' 59 | 60 | optscmd = ''.join([' -' + k + ' ' + encodev(v) for k,v in opt.iteritems()]) 61 | exe = cmd + optscmd + ' | tee /scr/r6/karpathy/neuraltalk2_checkpoints/out' + opt['id'] + '.txt' 62 | print exe 63 | os.system(exe) 64 | 65 | -------------------------------------------------------------------------------- /cv/killall.sh: -------------------------------------------------------------------------------- 1 | screen -ls | grep ak | cut -d. -f1 | awk '{print $1}' | xargs kill 2 | 3 | -------------------------------------------------------------------------------- /cv/runworker.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | echo "worker $1 is starting. Exporting LD_LIBRARY_PATH then running driver.py" 4 | export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$TORCHPATH/lib:/usr/local/lib:/usr/local/cuda/lib64:/home/stanford/cudnn_r3:/home/stanford/cudnn_r3/lib64 5 | python driver.py $1 6 | -------------------------------------------------------------------------------- /cv/spawn.sh: -------------------------------------------------------------------------------- 1 | # will spawn workers on the given GPU ids, in screen sessions prefixed with "ak" 2 | for i in "$@" 3 | do 4 | 5 | echo "spawning worker on GPU $i..." 6 | screen -S ak$i -d -m ./runworker.sh $i 7 | 8 | sleep 2 9 | done 10 | 11 | 12 | -------------------------------------------------------------------------------- /eval.lua: -------------------------------------------------------------------------------- 1 | require 'torch' 2 | require 'nn' 3 | require 'nngraph' 4 | -- exotics 5 | require 'loadcaffe' 6 | -- local imports 7 | local utils = require 'misc.utils' 8 | require 'misc.DataLoader' 9 | require 'misc.DataLoaderRaw' 10 | require 'misc.LanguageModel' 11 | local net_utils = require 'misc.net_utils' 12 | 13 | ------------------------------------------------------------------------------- 14 | -- Input arguments and options 15 | ------------------------------------------------------------------------------- 16 | cmd = torch.CmdLine() 17 | cmd:text() 18 | cmd:text('Train an Image Captioning model') 19 | cmd:text() 20 | cmd:text('Options') 21 | 22 | -- Input paths 23 | cmd:option('-model','','path to model to evaluate') 24 | -- Basic options 25 | cmd:option('-batch_size', 1, 'if > 0 then overrule, otherwise load from checkpoint.') 26 | cmd:option('-num_images', 100, 'how many images to use when periodically evaluating the loss? (-1 = all)') 27 | cmd:option('-language_eval', 0, 'Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.') 28 | cmd:option('-dump_images', 1, 'Dump images into vis/imgs folder for vis? (1=yes,0=no)') 29 | cmd:option('-dump_json', 1, 'Dump json with predictions into vis folder? (1=yes,0=no)') 30 | cmd:option('-dump_path', 0, 'Write image paths along with predictions into vis json? (1=yes,0=no)') 31 | -- Sampling options 32 | cmd:option('-sample_max', 1, '1 = sample argmax words. 0 = sample from distributions.') 33 | cmd:option('-beam_size', 2, 'used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.') 34 | cmd:option('-temperature', 1.0, 'temperature when sampling from distributions (i.e. when sample_max = 0). Lower = "safer" predictions.') 35 | -- For evaluation on a folder of images: 36 | cmd:option('-image_folder', '', 'If this is nonempty then will predict on the images in this folder path') 37 | cmd:option('-image_root', '', 'In case the image paths have to be preprended with a root path to an image folder') 38 | -- For evaluation on MSCOCO images from some split: 39 | cmd:option('-input_h5','','path to the h5file containing the preprocessed dataset. empty = fetch from model checkpoint.') 40 | cmd:option('-input_json','','path to the json file containing additional info and vocab. empty = fetch from model checkpoint.') 41 | cmd:option('-split', 'test', 'if running on MSCOCO images, which split to use: val|test|train') 42 | cmd:option('-coco_json', '', 'if nonempty then use this file in DataLoaderRaw (see docs there). Used only in MSCOCO test evaluation, where we have a specific json file of only test set images.') 43 | -- misc 44 | cmd:option('-backend', 'cudnn', 'nn|cudnn') 45 | cmd:option('-id', 'evalscript', 'an id identifying this run/job. used only if language_eval = 1 for appending to intermediate files') 46 | cmd:option('-seed', 123, 'random number generator seed to use') 47 | cmd:option('-gpuid', 0, 'which gpu to use. -1 = use CPU') 48 | cmd:text() 49 | 50 | ------------------------------------------------------------------------------- 51 | -- Basic Torch initializations 52 | ------------------------------------------------------------------------------- 53 | local opt = cmd:parse(arg) 54 | torch.manualSeed(opt.seed) 55 | torch.setdefaulttensortype('torch.FloatTensor') -- for CPU 56 | 57 | if opt.gpuid >= 0 then 58 | require 'cutorch' 59 | require 'cunn' 60 | if opt.backend == 'cudnn' then require 'cudnn' end 61 | cutorch.manualSeed(opt.seed) 62 | cutorch.setDevice(opt.gpuid + 1) -- note +1 because lua is 1-indexed 63 | end 64 | 65 | ------------------------------------------------------------------------------- 66 | -- Load the model checkpoint to evaluate 67 | ------------------------------------------------------------------------------- 68 | assert(string.len(opt.model) > 0, 'must provide a model') 69 | local checkpoint = torch.load(opt.model) 70 | -- override and collect parameters 71 | if string.len(opt.input_h5) == 0 then opt.input_h5 = checkpoint.opt.input_h5 end 72 | if string.len(opt.input_json) == 0 then opt.input_json = checkpoint.opt.input_json end 73 | if opt.batch_size == 0 then opt.batch_size = checkpoint.opt.batch_size end 74 | local fetch = {'rnn_size', 'input_encoding_size', 'drop_prob_lm', 'cnn_proto', 'cnn_model', 'seq_per_img'} 75 | for k,v in pairs(fetch) do 76 | opt[v] = checkpoint.opt[v] -- copy over options from model 77 | end 78 | local vocab = checkpoint.vocab -- ix -> word mapping 79 | 80 | ------------------------------------------------------------------------------- 81 | -- Create the Data Loader instance 82 | ------------------------------------------------------------------------------- 83 | local loader 84 | if string.len(opt.image_folder) == 0 then 85 | loader = DataLoader{h5_file = opt.input_h5, json_file = opt.input_json} 86 | else 87 | loader = DataLoaderRaw{folder_path = opt.image_folder, coco_json = opt.coco_json} 88 | end 89 | 90 | ------------------------------------------------------------------------------- 91 | -- Load the networks from model checkpoint 92 | ------------------------------------------------------------------------------- 93 | local protos = checkpoint.protos 94 | protos.expander = nn.FeatExpander(opt.seq_per_img) 95 | protos.crit = nn.LanguageModelCriterion() 96 | protos.lm:createClones() -- reconstruct clones inside the language model 97 | if opt.gpuid >= 0 then for k,v in pairs(protos) do v:cuda() end end 98 | 99 | ------------------------------------------------------------------------------- 100 | -- Evaluation fun(ction) 101 | ------------------------------------------------------------------------------- 102 | local function eval_split(split, evalopt) 103 | local verbose = utils.getopt(evalopt, 'verbose', true) 104 | local num_images = utils.getopt(evalopt, 'num_images', true) 105 | 106 | protos.cnn:evaluate() 107 | protos.lm:evaluate() 108 | loader:resetIterator(split) -- rewind iteator back to first datapoint in the split 109 | local n = 0 110 | local loss_sum = 0 111 | local loss_evals = 0 112 | local predictions = {} 113 | while true do 114 | 115 | -- fetch a batch of data 116 | local data = loader:getBatch{batch_size = opt.batch_size, split = split, seq_per_img = opt.seq_per_img} 117 | data.images = net_utils.prepro(data.images, false, opt.gpuid >= 0) -- preprocess in place, and don't augment 118 | n = n + data.images:size(1) 119 | 120 | -- forward the model to get loss 121 | local feats = protos.cnn:forward(data.images) 122 | 123 | -- evaluate loss if we have the labels 124 | local loss = 0 125 | if data.labels then 126 | local expanded_feats = protos.expander:forward(feats) 127 | local logprobs = protos.lm:forward{expanded_feats, data.labels} 128 | loss = protos.crit:forward(logprobs, data.labels) 129 | loss_sum = loss_sum + loss 130 | loss_evals = loss_evals + 1 131 | end 132 | 133 | -- forward the model to also get generated samples for each image 134 | local sample_opts = { sample_max = opt.sample_max, beam_size = opt.beam_size, temperature = opt.temperature } 135 | local seq = protos.lm:sample(feats, sample_opts) 136 | local sents = net_utils.decode_sequence(vocab, seq) 137 | for k=1,#sents do 138 | local entry = {image_id = data.infos[k].id, caption = sents[k]} 139 | if opt.dump_path == 1 then 140 | entry.file_name = data.infos[k].file_path 141 | end 142 | table.insert(predictions, entry) 143 | if opt.dump_images == 1 then 144 | -- dump the raw image to vis/ folder 145 | local cmd = 'cp "' .. path.join(opt.image_root, data.infos[k].file_path) .. '" vis/imgs/img' .. #predictions .. '.jpg' -- bit gross 146 | print(cmd) 147 | os.execute(cmd) -- dont think there is cleaner way in Lua 148 | end 149 | if verbose then 150 | print(string.format('image %s: %s', entry.image_id, entry.caption)) 151 | end 152 | end 153 | 154 | -- if we wrapped around the split or used up val imgs budget then bail 155 | local ix0 = data.bounds.it_pos_now 156 | local ix1 = math.min(data.bounds.it_max, num_images) 157 | if verbose then 158 | print(string.format('evaluating performance... %d/%d (%f)', ix0-1, ix1, loss)) 159 | end 160 | 161 | if data.bounds.wrapped then break end -- the split ran out of data, lets break out 162 | if num_images >= 0 and n >= num_images then break end -- we've used enough images 163 | end 164 | 165 | local lang_stats 166 | if opt.language_eval == 1 then 167 | lang_stats = net_utils.language_eval(predictions, opt.id) 168 | end 169 | 170 | return loss_sum/loss_evals, predictions, lang_stats 171 | end 172 | 173 | local loss, split_predictions, lang_stats = eval_split(opt.split, {num_images = opt.num_images}) 174 | print('loss: ', loss) 175 | if lang_stats then 176 | print(lang_stats) 177 | end 178 | 179 | if opt.dump_json == 1 then 180 | -- dump the json 181 | utils.write_json('vis/vis.json', split_predictions) 182 | end 183 | -------------------------------------------------------------------------------- /misc/DataLoader.lua: -------------------------------------------------------------------------------- 1 | require 'hdf5' 2 | local utils = require 'misc.utils' 3 | 4 | local DataLoader = torch.class('DataLoader') 5 | 6 | function DataLoader:__init(opt) 7 | 8 | -- load the json file which contains additional information about the dataset 9 | print('DataLoader loading json file: ', opt.json_file) 10 | self.info = utils.read_json(opt.json_file) 11 | self.ix_to_word = self.info.ix_to_word 12 | self.vocab_size = utils.count_keys(self.ix_to_word) 13 | print('vocab size is ' .. self.vocab_size) 14 | 15 | -- open the hdf5 file 16 | print('DataLoader loading h5 file: ', opt.h5_file) 17 | self.h5_file = hdf5.open(opt.h5_file, 'r') 18 | 19 | -- extract image size from dataset 20 | local images_size = self.h5_file:read('/images'):dataspaceSize() 21 | assert(#images_size == 4, '/images should be a 4D tensor') 22 | assert(images_size[3] == images_size[4], 'width and height must match') 23 | self.num_images = images_size[1] 24 | self.num_channels = images_size[2] 25 | self.max_image_size = images_size[3] 26 | print(string.format('read %d images of size %dx%dx%d', self.num_images, 27 | self.num_channels, self.max_image_size, self.max_image_size)) 28 | 29 | -- load in the sequence data 30 | local seq_size = self.h5_file:read('/labels'):dataspaceSize() 31 | self.seq_length = seq_size[2] 32 | print('max sequence length in data is ' .. self.seq_length) 33 | -- load the pointers in full to RAM (should be small enough) 34 | self.label_start_ix = self.h5_file:read('/label_start_ix'):all() 35 | self.label_end_ix = self.h5_file:read('/label_end_ix'):all() 36 | 37 | -- separate out indexes for each of the provided splits 38 | self.split_ix = {} 39 | self.iterators = {} 40 | for i,img in pairs(self.info.images) do 41 | local split = img.split 42 | if not self.split_ix[split] then 43 | -- initialize new split 44 | self.split_ix[split] = {} 45 | self.iterators[split] = 1 46 | end 47 | table.insert(self.split_ix[split], i) 48 | end 49 | for k,v in pairs(self.split_ix) do 50 | print(string.format('assigned %d images to split %s', #v, k)) 51 | end 52 | end 53 | 54 | function DataLoader:resetIterator(split) 55 | self.iterators[split] = 1 56 | end 57 | 58 | function DataLoader:getVocabSize() 59 | return self.vocab_size 60 | end 61 | 62 | function DataLoader:getVocab() 63 | return self.ix_to_word 64 | end 65 | 66 | function DataLoader:getSeqLength() 67 | return self.seq_length 68 | end 69 | 70 | --[[ 71 | Split is a string identifier (e.g. train|val|test) 72 | Returns a batch of data: 73 | - X (N,3,H,W) containing the images 74 | - y (L,M) containing the captions as columns (which is better for contiguous memory during training) 75 | - info table of length N, containing additional information 76 | The data is iterated linearly in order. Iterators for any split can be reset manually with resetIterator() 77 | --]] 78 | function DataLoader:getBatch(opt) 79 | local split = utils.getopt(opt, 'split') -- lets require that user passes this in, for safety 80 | local batch_size = utils.getopt(opt, 'batch_size', 5) -- how many images get returned at one time (to go through CNN) 81 | local seq_per_img = utils.getopt(opt, 'seq_per_img', 5) -- number of sequences to return per image 82 | 83 | local split_ix = self.split_ix[split] 84 | assert(split_ix, 'split ' .. split .. ' not found.') 85 | 86 | -- pick an index of the datapoint to load next 87 | local img_batch_raw = torch.ByteTensor(batch_size, 3, 256, 256) 88 | local label_batch = torch.LongTensor(batch_size * seq_per_img, self.seq_length) 89 | local max_index = #split_ix 90 | local wrapped = false 91 | local infos = {} 92 | for i=1,batch_size do 93 | 94 | local ri = self.iterators[split] -- get next index from iterator 95 | local ri_next = ri + 1 -- increment iterator 96 | if ri_next > max_index then ri_next = 1; wrapped = true end -- wrap back around 97 | self.iterators[split] = ri_next 98 | ix = split_ix[ri] 99 | assert(ix ~= nil, 'bug: split ' .. split .. ' was accessed out of bounds with ' .. ri) 100 | 101 | -- fetch the image from h5 102 | local img = self.h5_file:read('/images'):partial({ix,ix},{1,self.num_channels}, 103 | {1,self.max_image_size},{1,self.max_image_size}) 104 | img_batch_raw[i] = img 105 | 106 | -- fetch the sequence labels 107 | local ix1 = self.label_start_ix[ix] 108 | local ix2 = self.label_end_ix[ix] 109 | local ncap = ix2 - ix1 + 1 -- number of captions available for this image 110 | assert(ncap > 0, 'an image does not have any label. this can be handled but right now isn\'t') 111 | local seq 112 | if ncap < seq_per_img then 113 | -- we need to subsample (with replacement) 114 | seq = torch.LongTensor(seq_per_img, self.seq_length) 115 | for q=1, seq_per_img do 116 | local ixl = torch.random(ix1,ix2) 117 | seq[{ {q,q} }] = self.h5_file:read('/labels'):partial({ixl, ixl}, {1,self.seq_length}) 118 | end 119 | else 120 | -- there is enough data to read a contiguous chunk, but subsample the chunk position 121 | local ixl = torch.random(ix1, ix2 - seq_per_img + 1) -- generates integer in the range 122 | seq = self.h5_file:read('/labels'):partial({ixl, ixl+seq_per_img-1}, {1,self.seq_length}) 123 | end 124 | local il = (i-1)*seq_per_img+1 125 | label_batch[{ {il,il+seq_per_img-1} }] = seq 126 | 127 | -- and record associated info as well 128 | local info_struct = {} 129 | info_struct.id = self.info.images[ix].id 130 | info_struct.file_path = self.info.images[ix].file_path 131 | table.insert(infos, info_struct) 132 | end 133 | 134 | local data = {} 135 | data.images = img_batch_raw 136 | data.labels = label_batch:transpose(1,2):contiguous() -- note: make label sequences go down as columns 137 | data.bounds = {it_pos_now = self.iterators[split], it_max = #split_ix, wrapped = wrapped} 138 | data.infos = infos 139 | return data 140 | end 141 | 142 | -------------------------------------------------------------------------------- /misc/DataLoaderRaw.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | Same as DataLoader but only requires a folder of images. 3 | Does not have an h5 dependency. 4 | Only used at test time. 5 | ]]-- 6 | 7 | local utils = require 'misc.utils' 8 | require 'lfs' 9 | require 'image' 10 | 11 | local DataLoaderRaw = torch.class('DataLoaderRaw') 12 | 13 | function DataLoaderRaw:__init(opt) 14 | local coco_json = utils.getopt(opt, 'coco_json', '') 15 | 16 | -- load the json file which contains additional information about the dataset 17 | print('DataLoaderRaw loading images from folder: ', opt.folder_path) 18 | 19 | self.files = {} 20 | self.ids = {} 21 | if string.len(opt.coco_json) > 0 then 22 | print('reading from ' .. opt.coco_json) 23 | -- read in filenames from the coco-style json file 24 | self.coco_annotation = utils.read_json(opt.coco_json) 25 | for k,v in pairs(self.coco_annotation.images) do 26 | local fullpath = path.join(opt.folder_path, v.file_name) 27 | table.insert(self.files, fullpath) 28 | table.insert(self.ids, v.id) 29 | end 30 | else 31 | -- read in all the filenames from the folder 32 | print('listing all images in directory ' .. opt.folder_path) 33 | local function isImage(f) 34 | local supportedExt = {'.jpg','.JPG','.jpeg','.JPEG','.png','.PNG','.ppm','.PPM'} 35 | for _,ext in pairs(supportedExt) do 36 | local _, end_idx = f:find(ext) 37 | if end_idx and end_idx == f:len() then 38 | return true 39 | end 40 | end 41 | return false 42 | end 43 | local n = 1 44 | for file in paths.files(opt.folder_path, isImage) do 45 | local fullpath = path.join(opt.folder_path, file) 46 | table.insert(self.files, fullpath) 47 | table.insert(self.ids, tostring(n)) -- just order them sequentially 48 | n=n+1 49 | end 50 | end 51 | 52 | self.N = #self.files 53 | print('DataLoaderRaw found ' .. self.N .. ' images') 54 | 55 | self.iterator = 1 56 | end 57 | 58 | function DataLoaderRaw:resetIterator() 59 | self.iterator = 1 60 | end 61 | 62 | --[[ 63 | Returns a batch of data: 64 | - X (N,3,256,256) containing the images as uint8 ByteTensor 65 | - info table of length N, containing additional information 66 | The data is iterated linearly in order 67 | --]] 68 | function DataLoaderRaw:getBatch(opt) 69 | local batch_size = utils.getopt(opt, 'batch_size', 5) -- how many images get returned at one time (to go through CNN) 70 | -- pick an index of the datapoint to load next 71 | local img_batch_raw = torch.ByteTensor(batch_size, 3, 256, 256) 72 | local max_index = self.N 73 | local wrapped = false 74 | local infos = {} 75 | for i=1,batch_size do 76 | local ri = self.iterator 77 | local ri_next = ri + 1 -- increment iterator 78 | if ri_next > max_index then ri_next = 1; wrapped = true end -- wrap back around 79 | self.iterator = ri_next 80 | 81 | -- load the image 82 | local img = image.load(self.files[ri], 3, 'byte') 83 | img_batch_raw[i] = image.scale(img, 256, 256) 84 | 85 | -- and record associated info as well 86 | local info_struct = {} 87 | info_struct.id = self.ids[ri] 88 | info_struct.file_path = self.files[ri] 89 | table.insert(infos, info_struct) 90 | end 91 | 92 | local data = {} 93 | data.images = img_batch_raw 94 | data.bounds = {it_pos_now = self.iterator, it_max = self.N, wrapped = wrapped} 95 | data.infos = infos 96 | return data 97 | end 98 | -------------------------------------------------------------------------------- /misc/LSTM.lua: -------------------------------------------------------------------------------- 1 | require 'nn' 2 | require 'nngraph' 3 | 4 | local LSTM = {} 5 | function LSTM.lstm(input_size, output_size, rnn_size, n, dropout) 6 | dropout = dropout or 0 7 | 8 | -- there will be 2*n+1 inputs 9 | local inputs = {} 10 | table.insert(inputs, nn.Identity()()) -- indices giving the sequence of symbols 11 | for L = 1,n do 12 | table.insert(inputs, nn.Identity()()) -- prev_c[L] 13 | table.insert(inputs, nn.Identity()()) -- prev_h[L] 14 | end 15 | 16 | local x, input_size_L 17 | local outputs = {} 18 | for L = 1,n do 19 | -- c,h from previos timesteps 20 | local prev_h = inputs[L*2+1] 21 | local prev_c = inputs[L*2] 22 | -- the input to this layer 23 | if L == 1 then 24 | x = inputs[1] 25 | input_size_L = input_size 26 | else 27 | x = outputs[(L-1)*2] 28 | if dropout > 0 then x = nn.Dropout(dropout)(x):annotate{name='drop_' .. L} end -- apply dropout, if any 29 | input_size_L = rnn_size 30 | end 31 | -- evaluate the input sums at once for efficiency 32 | local i2h = nn.Linear(input_size_L, 4 * rnn_size)(x):annotate{name='i2h_'..L} 33 | local h2h = nn.Linear(rnn_size, 4 * rnn_size)(prev_h):annotate{name='h2h_'..L} 34 | local all_input_sums = nn.CAddTable()({i2h, h2h}) 35 | 36 | local reshaped = nn.Reshape(4, rnn_size)(all_input_sums) 37 | local n1, n2, n3, n4 = nn.SplitTable(2)(reshaped):split(4) 38 | -- decode the gates 39 | local in_gate = nn.Sigmoid()(n1) 40 | local forget_gate = nn.Sigmoid()(n2) 41 | local out_gate = nn.Sigmoid()(n3) 42 | -- decode the write inputs 43 | local in_transform = nn.Tanh()(n4) 44 | -- perform the LSTM update 45 | local next_c = nn.CAddTable()({ 46 | nn.CMulTable()({forget_gate, prev_c}), 47 | nn.CMulTable()({in_gate, in_transform}) 48 | }) 49 | -- gated cells form the output 50 | local next_h = nn.CMulTable()({out_gate, nn.Tanh()(next_c)}) 51 | 52 | table.insert(outputs, next_c) 53 | table.insert(outputs, next_h) 54 | end 55 | 56 | -- set up the decoder 57 | local top_h = outputs[#outputs] 58 | if dropout > 0 then top_h = nn.Dropout(dropout)(top_h):annotate{name='drop_final'} end 59 | local proj = nn.Linear(rnn_size, output_size)(top_h):annotate{name='decoder'} 60 | local logsoft = nn.LogSoftMax()(proj) 61 | table.insert(outputs, logsoft) 62 | 63 | return nn.gModule(inputs, outputs) 64 | end 65 | 66 | return LSTM 67 | 68 | -------------------------------------------------------------------------------- /misc/LanguageModel.lua: -------------------------------------------------------------------------------- 1 | require 'nn' 2 | local utils = require 'misc.utils' 3 | local net_utils = require 'misc.net_utils' 4 | local LSTM = require 'misc.LSTM' 5 | 6 | ------------------------------------------------------------------------------- 7 | -- Language Model core 8 | ------------------------------------------------------------------------------- 9 | 10 | local layer, parent = torch.class('nn.LanguageModel', 'nn.Module') 11 | function layer:__init(opt) 12 | parent.__init(self) 13 | 14 | -- options for core network 15 | self.vocab_size = utils.getopt(opt, 'vocab_size') -- required 16 | self.input_encoding_size = utils.getopt(opt, 'input_encoding_size') 17 | self.rnn_size = utils.getopt(opt, 'rnn_size') 18 | self.num_layers = utils.getopt(opt, 'num_layers', 1) 19 | local dropout = utils.getopt(opt, 'dropout', 0) 20 | -- options for Language Model 21 | self.seq_length = utils.getopt(opt, 'seq_length') 22 | -- create the core lstm network. note +1 for both the START and END tokens 23 | self.core = LSTM.lstm(self.input_encoding_size, self.vocab_size + 1, self.rnn_size, self.num_layers, dropout) 24 | self.lookup_table = nn.LookupTable(self.vocab_size + 1, self.input_encoding_size) 25 | self:_createInitState(1) -- will be lazily resized later during forward passes 26 | end 27 | 28 | function layer:_createInitState(batch_size) 29 | assert(batch_size ~= nil, 'batch size must be provided') 30 | -- construct the initial state for the LSTM 31 | if not self.init_state then self.init_state = {} end -- lazy init 32 | for h=1,self.num_layers*2 do 33 | -- note, the init state Must be zeros because we are using init_state to init grads in backward call too 34 | if self.init_state[h] then 35 | if self.init_state[h]:size(1) ~= batch_size then 36 | self.init_state[h]:resize(batch_size, self.rnn_size):zero() -- expand the memory 37 | end 38 | else 39 | self.init_state[h] = torch.zeros(batch_size, self.rnn_size) 40 | end 41 | end 42 | self.num_state = #self.init_state 43 | end 44 | 45 | function layer:createClones() 46 | -- construct the net clones 47 | print('constructing clones inside the LanguageModel') 48 | self.clones = {self.core} 49 | self.lookup_tables = {self.lookup_table} 50 | for t=2,self.seq_length+2 do 51 | self.clones[t] = self.core:clone('weight', 'bias', 'gradWeight', 'gradBias') 52 | self.lookup_tables[t] = self.lookup_table:clone('weight', 'gradWeight') 53 | end 54 | end 55 | 56 | function layer:getModulesList() 57 | return {self.core, self.lookup_table} 58 | end 59 | 60 | function layer:parameters() 61 | -- we only have two internal modules, return their params 62 | local p1,g1 = self.core:parameters() 63 | local p2,g2 = self.lookup_table:parameters() 64 | 65 | local params = {} 66 | for k,v in pairs(p1) do table.insert(params, v) end 67 | for k,v in pairs(p2) do table.insert(params, v) end 68 | 69 | local grad_params = {} 70 | for k,v in pairs(g1) do table.insert(grad_params, v) end 71 | for k,v in pairs(g2) do table.insert(grad_params, v) end 72 | 73 | -- todo: invalidate self.clones if params were requested? 74 | -- what if someone outside of us decided to call getParameters() or something? 75 | -- (that would destroy our parameter sharing because clones 2...end would point to old memory) 76 | 77 | return params, grad_params 78 | end 79 | 80 | function layer:training() 81 | if self.clones == nil then self:createClones() end -- create these lazily if needed 82 | for k,v in pairs(self.clones) do v:training() end 83 | for k,v in pairs(self.lookup_tables) do v:training() end 84 | end 85 | 86 | function layer:evaluate() 87 | if self.clones == nil then self:createClones() end -- create these lazily if needed 88 | for k,v in pairs(self.clones) do v:evaluate() end 89 | for k,v in pairs(self.lookup_tables) do v:evaluate() end 90 | end 91 | 92 | --[[ 93 | takes a batch of images and runs the model forward in sampling mode 94 | Careful: make sure model is in :evaluate() mode if you're calling this. 95 | Returns: a DxN LongTensor with integer elements 1..M, 96 | where D is sequence length and N is batch (so columns are sequences) 97 | --]] 98 | function layer:sample(imgs, opt) 99 | local sample_max = utils.getopt(opt, 'sample_max', 1) 100 | local beam_size = utils.getopt(opt, 'beam_size', 1) 101 | local temperature = utils.getopt(opt, 'temperature', 1.0) 102 | if sample_max == 1 and beam_size > 1 then return self:sample_beam(imgs, opt) end -- indirection for beam search 103 | 104 | local batch_size = imgs:size(1) 105 | self:_createInitState(batch_size) 106 | local state = self.init_state 107 | 108 | -- we will write output predictions into tensor seq 109 | local seq = torch.LongTensor(self.seq_length, batch_size):zero() 110 | local seqLogprobs = torch.FloatTensor(self.seq_length, batch_size) 111 | local logprobs -- logprobs predicted in last time step 112 | for t=1,self.seq_length+2 do 113 | 114 | local xt, it, sampleLogprobs 115 | if t == 1 then 116 | -- feed in the images 117 | xt = imgs 118 | elseif t == 2 then 119 | -- feed in the start tokens 120 | it = torch.LongTensor(batch_size):fill(self.vocab_size+1) 121 | xt = self.lookup_table:forward(it) 122 | else 123 | -- take predictions from previous time step and feed them in 124 | if sample_max == 1 then 125 | -- use argmax "sampling" 126 | sampleLogprobs, it = torch.max(logprobs, 2) 127 | it = it:view(-1):long() 128 | else 129 | -- sample from the distribution of previous predictions 130 | local prob_prev 131 | if temperature == 1.0 then 132 | prob_prev = torch.exp(logprobs) -- fetch prev distribution: shape Nx(M+1) 133 | else 134 | -- scale logprobs by temperature 135 | prob_prev = torch.exp(torch.div(logprobs, temperature)) 136 | end 137 | it = torch.multinomial(prob_prev, 1) 138 | sampleLogprobs = logprobs:gather(2, it) -- gather the logprobs at sampled positions 139 | it = it:view(-1):long() -- and flatten indices for downstream processing 140 | end 141 | xt = self.lookup_table:forward(it) 142 | end 143 | 144 | if t >= 3 then 145 | seq[t-2] = it -- record the samples 146 | seqLogprobs[t-2] = sampleLogprobs:view(-1):float() -- and also their log likelihoods 147 | end 148 | 149 | local inputs = {xt,unpack(state)} 150 | local out = self.core:forward(inputs) 151 | logprobs = out[self.num_state+1] -- last element is the output vector 152 | state = {} 153 | for i=1,self.num_state do table.insert(state, out[i]) end 154 | end 155 | 156 | -- return the samples and their log likelihoods 157 | return seq, seqLogprobs 158 | end 159 | 160 | --[[ 161 | Implements beam search. Really tricky indexing stuff going on inside. 162 | Not 100% sure it's correct, and hard to fully unit test to satisfaction, but 163 | it seems to work, doesn't crash, gives expected looking outputs, and seems to 164 | improve performance, so I am declaring this correct. 165 | ]]-- 166 | function layer:sample_beam(imgs, opt) 167 | local beam_size = utils.getopt(opt, 'beam_size', 10) 168 | local batch_size, feat_dim = imgs:size(1), imgs:size(2) 169 | local function compare(a,b) return a.p > b.p end -- used downstream 170 | 171 | assert(beam_size <= self.vocab_size+1, 'lets assume this for now, otherwise this corner case causes a few headaches down the road. can be dealt with in future if needed') 172 | 173 | local seq = torch.LongTensor(self.seq_length, batch_size):zero() 174 | local seqLogprobs = torch.FloatTensor(self.seq_length, batch_size) 175 | -- lets process every image independently for now, for simplicity 176 | for k=1,batch_size do 177 | 178 | -- create initial states for all beams 179 | self:_createInitState(beam_size) 180 | local state = self.init_state 181 | 182 | -- we will write output predictions into tensor seq 183 | local beam_seq = torch.LongTensor(self.seq_length, beam_size):zero() 184 | local beam_seq_logprobs = torch.FloatTensor(self.seq_length, beam_size):zero() 185 | local beam_logprobs_sum = torch.zeros(beam_size) -- running sum of logprobs for each beam 186 | local logprobs -- logprobs predicted in last time step, shape (beam_size, vocab_size+1) 187 | local done_beams = {} 188 | for t=1,self.seq_length+2 do 189 | 190 | local xt, it, sampleLogprobs 191 | local new_state 192 | if t == 1 then 193 | -- feed in the images 194 | local imgk = imgs[{ {k,k} }]:expand(beam_size, feat_dim) -- k'th image feature expanded out 195 | xt = imgk 196 | elseif t == 2 then 197 | -- feed in the start tokens 198 | it = torch.LongTensor(beam_size):fill(self.vocab_size+1) 199 | xt = self.lookup_table:forward(it) 200 | else 201 | --[[ 202 | perform a beam merge. that is, 203 | for every previous beam we now many new possibilities to branch out 204 | we need to resort our beams to maintain the loop invariant of keeping 205 | the top beam_size most likely sequences. 206 | ]]-- 207 | local logprobsf = logprobs:float() -- lets go to CPU for more efficiency in indexing operations 208 | ys,ix = torch.sort(logprobsf,2,true) -- sorted array of logprobs along each previous beam (last true = descending) 209 | local candidates = {} 210 | local cols = math.min(beam_size,ys:size(2)) 211 | local rows = beam_size 212 | if t == 3 then rows = 1 end -- at first time step only the first beam is active 213 | for c=1,cols do -- for each column (word, essentially) 214 | for q=1,rows do -- for each beam expansion 215 | -- compute logprob of expanding beam q with word in (sorted) position c 216 | local local_logprob = ys[{ q,c }] 217 | local candidate_logprob = beam_logprobs_sum[q] + local_logprob 218 | table.insert(candidates, {c=ix[{ q,c }], q=q, p=candidate_logprob, r=local_logprob }) 219 | end 220 | end 221 | table.sort(candidates, compare) -- find the best c,q pairs 222 | 223 | -- construct new beams 224 | new_state = net_utils.clone_list(state) 225 | local beam_seq_prev, beam_seq_logprobs_prev 226 | if t > 3 then 227 | -- well need these as reference when we fork beams around 228 | beam_seq_prev = beam_seq[{ {1,t-3}, {} }]:clone() 229 | beam_seq_logprobs_prev = beam_seq_logprobs[{ {1,t-3}, {} }]:clone() 230 | end 231 | for vix=1,beam_size do 232 | local v = candidates[vix] 233 | -- fork beam index q into index vix 234 | if t > 3 then 235 | beam_seq[{ {1,t-3}, vix }] = beam_seq_prev[{ {}, v.q }] 236 | beam_seq_logprobs[{ {1,t-3}, vix }] = beam_seq_logprobs_prev[{ {}, v.q }] 237 | end 238 | -- rearrange recurrent states 239 | for state_ix = 1,#new_state do 240 | -- copy over state in previous beam q to new beam at vix 241 | new_state[state_ix][vix] = state[state_ix][v.q] 242 | end 243 | -- append new end terminal at the end of this beam 244 | beam_seq[{ t-2, vix }] = v.c -- c'th word is the continuation 245 | beam_seq_logprobs[{ t-2, vix }] = v.r -- the raw logprob here 246 | beam_logprobs_sum[vix] = v.p -- the new (sum) logprob along this beam 247 | 248 | if v.c == self.vocab_size+1 or t == self.seq_length+2 then 249 | -- END token special case here, or we reached the end. 250 | -- add the beam to a set of done beams 251 | table.insert(done_beams, {seq = beam_seq[{ {}, vix }]:clone(), 252 | logps = beam_seq_logprobs[{ {}, vix }]:clone(), 253 | p = beam_logprobs_sum[vix] 254 | }) 255 | end 256 | end 257 | 258 | -- encode as vectors 259 | it = beam_seq[t-2] 260 | xt = self.lookup_table:forward(it) 261 | end 262 | 263 | if new_state then state = new_state end -- swap rnn state, if we reassinged beams 264 | 265 | local inputs = {xt,unpack(state)} 266 | local out = self.core:forward(inputs) 267 | logprobs = out[self.num_state+1] -- last element is the output vector 268 | state = {} 269 | for i=1,self.num_state do table.insert(state, out[i]) end 270 | end 271 | 272 | table.sort(done_beams, compare) 273 | seq[{ {}, k }] = done_beams[1].seq -- the first beam has highest cumulative score 274 | seqLogprobs[{ {}, k }] = done_beams[1].logps 275 | end 276 | 277 | -- return the samples and their log likelihoods 278 | return seq, seqLogprobs 279 | end 280 | 281 | --[[ 282 | input is a tuple of: 283 | 1. torch.Tensor of size NxK (K is dim of image code) 284 | 2. torch.LongTensor of size DxN, elements 1..M 285 | where M = opt.vocab_size and D = opt.seq_length 286 | 287 | returns a (D+2)xNx(M+1) Tensor giving (normalized) log probabilities for the 288 | next token at every iteration of the LSTM (+2 because +1 for first dummy 289 | img forward, and another +1 because of START/END tokens shift) 290 | --]] 291 | function layer:updateOutput(input) 292 | local imgs = input[1] 293 | local seq = input[2] 294 | if self.clones == nil then self:createClones() end -- lazily create clones on first forward pass 295 | 296 | assert(seq:size(1) == self.seq_length) 297 | local batch_size = seq:size(2) 298 | self.output:resize(self.seq_length+2, batch_size, self.vocab_size+1) 299 | 300 | self:_createInitState(batch_size) 301 | 302 | self.state = {[0] = self.init_state} 303 | self.inputs = {} 304 | self.lookup_tables_inputs = {} 305 | self.tmax = 0 -- we will keep track of max sequence length encountered in the data for efficiency 306 | for t=1,self.seq_length+2 do 307 | 308 | local can_skip = false 309 | local xt 310 | if t == 1 then 311 | -- feed in the images 312 | xt = imgs -- NxK sized input 313 | elseif t == 2 then 314 | -- feed in the start tokens 315 | local it = torch.LongTensor(batch_size):fill(self.vocab_size+1) 316 | self.lookup_tables_inputs[t] = it 317 | xt = self.lookup_tables[t]:forward(it) -- NxK sized input (token embedding vectors) 318 | else 319 | -- feed in the rest of the sequence... 320 | local it = seq[t-2]:clone() 321 | if torch.sum(it) == 0 then 322 | -- computational shortcut for efficiency. All sequences have already terminated and only 323 | -- contain null tokens from here on. We can skip the rest of the forward pass and save time 324 | can_skip = true 325 | end 326 | --[[ 327 | seq may contain zeros as null tokens, make sure we take them out to any arbitrary token 328 | that won't make lookup_table crash with an error. 329 | token #1 will do, arbitrarily. This will be ignored anyway 330 | because we will carefully set the loss to zero at these places 331 | in the criterion, so computation based on this value will be noop for the optimization. 332 | --]] 333 | it[torch.eq(it,0)] = 1 334 | 335 | if not can_skip then 336 | self.lookup_tables_inputs[t] = it 337 | xt = self.lookup_tables[t]:forward(it) 338 | end 339 | end 340 | 341 | if not can_skip then 342 | -- construct the inputs 343 | self.inputs[t] = {xt,unpack(self.state[t-1])} 344 | -- forward the network 345 | local out = self.clones[t]:forward(self.inputs[t]) 346 | -- process the outputs 347 | self.output[t] = out[self.num_state+1] -- last element is the output vector 348 | self.state[t] = {} -- the rest is state 349 | for i=1,self.num_state do table.insert(self.state[t], out[i]) end 350 | self.tmax = t 351 | end 352 | end 353 | 354 | return self.output 355 | end 356 | 357 | --[[ 358 | gradOutput is an (D+2)xNx(M+1) Tensor. 359 | --]] 360 | function layer:updateGradInput(input, gradOutput) 361 | local dimgs -- grad on input images 362 | 363 | -- go backwards and lets compute gradients 364 | local dstate = {[self.tmax] = self.init_state} -- this works when init_state is all zeros 365 | for t=self.tmax,1,-1 do 366 | -- concat state gradients and output vector gradients at time step t 367 | local dout = {} 368 | for k=1,#dstate[t] do table.insert(dout, dstate[t][k]) end 369 | table.insert(dout, gradOutput[t]) 370 | local dinputs = self.clones[t]:backward(self.inputs[t], dout) 371 | -- split the gradient to xt and to state 372 | local dxt = dinputs[1] -- first element is the input vector 373 | dstate[t-1] = {} -- copy over rest to state grad 374 | for k=2,self.num_state+1 do table.insert(dstate[t-1], dinputs[k]) end 375 | 376 | -- continue backprop of xt 377 | if t == 1 then 378 | dimgs = dxt 379 | else 380 | local it = self.lookup_tables_inputs[t] 381 | self.lookup_tables[t]:backward(it, dxt) -- backprop into lookup table 382 | end 383 | end 384 | 385 | -- we have gradient on image, but for LongTensor gt sequence we only create an empty tensor - can't backprop 386 | self.gradInput = {dimgs, torch.Tensor()} 387 | return self.gradInput 388 | end 389 | 390 | ------------------------------------------------------------------------------- 391 | -- Language Model-aware Criterion 392 | ------------------------------------------------------------------------------- 393 | 394 | local crit, parent = torch.class('nn.LanguageModelCriterion', 'nn.Criterion') 395 | function crit:__init() 396 | parent.__init(self) 397 | end 398 | 399 | --[[ 400 | input is a Tensor of size (D+2)xNx(M+1) 401 | seq is a LongTensor of size DxN. The way we infer the target 402 | in this criterion is as follows: 403 | - at first time step the output is ignored (loss = 0). It's the image tick 404 | - the label sequence "seq" is shifted by one to produce targets 405 | - at last time step the output is always the special END token (last dimension) 406 | The criterion must be able to accomodate variably-sized sequences by making sure 407 | the gradients are properly set to zeros where appropriate. 408 | --]] 409 | function crit:updateOutput(input, seq) 410 | self.gradInput:resizeAs(input):zero() -- reset to zeros 411 | local L,N,Mp1 = input:size(1), input:size(2), input:size(3) 412 | local D = seq:size(1) 413 | assert(D == L-2, 'input Tensor should be 2 larger in time') 414 | 415 | local loss = 0 416 | local n = 0 417 | for b=1,N do -- iterate over batches 418 | local first_time = true 419 | for t=2,L do -- iterate over sequence time (ignore t=1, dummy forward for the image) 420 | 421 | -- fetch the index of the next token in the sequence 422 | local target_index 423 | if t-1 > D then -- we are out of bounds of the index sequence: pad with null tokens 424 | target_index = 0 425 | else 426 | target_index = seq[{t-1,b}] -- t-1 is correct, since at t=2 START token was fed in and we want to predict first word (and 2-1 = 1). 427 | end 428 | -- the first time we see null token as next index, actually want the model to predict the END token 429 | if target_index == 0 and first_time then 430 | target_index = Mp1 431 | first_time = false 432 | end 433 | 434 | -- if there is a non-null next token, enforce loss! 435 | if target_index ~= 0 then 436 | -- accumulate loss 437 | loss = loss - input[{ t,b,target_index }] -- log(p) 438 | self.gradInput[{ t,b,target_index }] = -1 439 | n = n + 1 440 | end 441 | 442 | end 443 | end 444 | self.output = loss / n -- normalize by number of predictions that were made 445 | self.gradInput:div(n) 446 | return self.output 447 | end 448 | 449 | function crit:updateGradInput(input, seq) 450 | return self.gradInput 451 | end 452 | -------------------------------------------------------------------------------- /misc/call_python_caption_eval.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cd coco-caption 4 | python myeval.py $1 5 | cd ../ 6 | -------------------------------------------------------------------------------- /misc/gradcheck.lua: -------------------------------------------------------------------------------- 1 | local gradcheck = {} 2 | 3 | 4 | function gradcheck.relative_error(x, y, h) 5 | h = h or 1e-12 6 | if torch.isTensor(x) and torch.isTensor(y) then 7 | local top = torch.abs(x - y) 8 | local bottom = torch.cmax(torch.abs(x) + torch.abs(y), h) 9 | return torch.max(torch.cdiv(top, bottom)) 10 | else 11 | return math.abs(x - y) / math.max(math.abs(x) + math.abs(y), h) 12 | end 13 | end 14 | 15 | 16 | function gradcheck.numeric_gradient(f, x, df, eps) 17 | df = df or 1.0 18 | eps = eps or 1e-8 19 | local n = x:nElement() 20 | local x_flat = x:view(n) 21 | local dx_num = x.new(#x):zero() 22 | local dx_num_flat = dx_num:view(n) 23 | for i = 1, n do 24 | local orig = x_flat[i] 25 | 26 | x_flat[i] = orig + eps 27 | local pos = f(x) 28 | if torch.isTensor(df) then 29 | pos = pos:clone() 30 | end 31 | 32 | x_flat[i] = orig - eps 33 | local neg = f(x) 34 | if torch.isTensor(df) then 35 | neg = neg:clone() 36 | end 37 | 38 | local d = nil 39 | if torch.isTensor(df) then 40 | d = torch.dot(pos - neg, df) / (2 * eps) 41 | else 42 | d = df * (pos - neg) / (2 * eps) 43 | end 44 | 45 | dx_num_flat[i] = d 46 | x_flat[i] = orig 47 | end 48 | return dx_num 49 | end 50 | 51 | 52 | --[[ 53 | Inputs: 54 | - f is a function that takes a tensor and returns a scalar 55 | - x is the point at which to evalute f 56 | - dx is the analytic gradient of f at x 57 | --]] 58 | function gradcheck.check_random_dims(f, x, dx, eps, num_iterations, verbose) 59 | if verbose == nil then verbose = false end 60 | eps = eps or 1e-4 61 | 62 | local x_flat = x:view(-1) 63 | local dx_flat = dx:view(-1) 64 | 65 | local relative_errors = torch.Tensor(num_iterations) 66 | 67 | for t = 1, num_iterations do 68 | -- Make sure the index is really random. 69 | -- We have to call this on the inner loop because some functions 70 | -- f may be stochastic, and eliminating their internal randomness for 71 | -- gradient checking by setting a manual seed. If this is the case, 72 | -- then we will always sample the same index unless we reseed on each 73 | -- iteration. 74 | torch.seed() 75 | local i = torch.random(x:nElement()) 76 | 77 | local orig = x_flat[i] 78 | x_flat[i] = orig + eps 79 | local pos = f(x) 80 | 81 | x_flat[i] = orig - eps 82 | local neg = f(x) 83 | local d_numeric = (pos - neg) / (2 * eps) 84 | local d_analytic = dx_flat[i] 85 | 86 | x_flat[i] = orig 87 | 88 | local rel_error = gradcheck.relative_error(d_numeric, d_analytic) 89 | relative_errors[t] = rel_error 90 | if verbose then 91 | print(string.format(' Iteration %d / %d, error = %f', 92 | t, num_iterations, rel_error)) 93 | print(string.format(' %f %f', d_numeric, d_analytic)) 94 | end 95 | end 96 | return relative_errors 97 | end 98 | 99 | 100 | return gradcheck 101 | 102 | -------------------------------------------------------------------------------- /misc/net_utils.lua: -------------------------------------------------------------------------------- 1 | local utils = require 'misc.utils' 2 | local net_utils = {} 3 | 4 | -- take a raw CNN from Caffe and perform surgery. Note: VGG-16 SPECIFIC! 5 | function net_utils.build_cnn(cnn, opt) 6 | local layer_num = utils.getopt(opt, 'layer_num', 38) 7 | local backend = utils.getopt(opt, 'backend', 'cudnn') 8 | local encoding_size = utils.getopt(opt, 'encoding_size', 512) 9 | 10 | if backend == 'cudnn' then 11 | require 'cudnn' 12 | backend = cudnn 13 | elseif backend == 'nn' then 14 | require 'nn' 15 | backend = nn 16 | else 17 | error(string.format('Unrecognized backend "%s"', backend)) 18 | end 19 | 20 | -- copy over the first layer_num layers of the CNN 21 | local cnn_part = nn.Sequential() 22 | for i = 1, layer_num do 23 | local layer = cnn:get(i) 24 | 25 | if i == 1 then 26 | -- convert kernels in first conv layer into RGB format instead of BGR, 27 | -- which is the order in which it was trained in Caffe 28 | local w = layer.weight:clone() 29 | -- swap weights to R and B channels 30 | print('converting first layer conv filters from BGR to RGB...') 31 | layer.weight[{ {}, 1, {}, {} }]:copy(w[{ {}, 3, {}, {} }]) 32 | layer.weight[{ {}, 3, {}, {} }]:copy(w[{ {}, 1, {}, {} }]) 33 | end 34 | 35 | cnn_part:add(layer) 36 | end 37 | 38 | cnn_part:add(nn.Linear(4096,encoding_size)) 39 | cnn_part:add(backend.ReLU(true)) 40 | return cnn_part 41 | end 42 | 43 | -- takes a batch of images and preprocesses them 44 | -- VGG-16 network is hardcoded, as is 224 as size to forward 45 | function net_utils.prepro(imgs, data_augment, on_gpu) 46 | assert(data_augment ~= nil, 'pass this in. careful here.') 47 | assert(on_gpu ~= nil, 'pass this in. careful here.') 48 | 49 | local h,w = imgs:size(3), imgs:size(4) 50 | local cnn_input_size = 224 51 | 52 | -- cropping data augmentation, if needed 53 | if h > cnn_input_size or w > cnn_input_size then 54 | local xoff, yoff 55 | if data_augment then 56 | xoff, yoff = torch.random(w-cnn_input_size), torch.random(h-cnn_input_size) 57 | else 58 | -- sample the center 59 | xoff, yoff = math.ceil((w-cnn_input_size)/2), math.ceil((h-cnn_input_size)/2) 60 | end 61 | -- crop. 62 | imgs = imgs[{ {}, {}, {yoff,yoff+cnn_input_size-1}, {xoff,xoff+cnn_input_size-1} }] 63 | end 64 | 65 | -- ship to gpu or convert from byte to float 66 | if on_gpu then imgs = imgs:cuda() else imgs = imgs:float() end 67 | 68 | -- lazily instantiate vgg_mean 69 | if not net_utils.vgg_mean then 70 | net_utils.vgg_mean = torch.FloatTensor{123.68, 116.779, 103.939}:view(1,3,1,1) -- in RGB order 71 | end 72 | net_utils.vgg_mean = net_utils.vgg_mean:typeAs(imgs) -- a noop if the types match 73 | 74 | -- subtract vgg mean 75 | imgs:add(-1, net_utils.vgg_mean:expandAs(imgs)) 76 | 77 | return imgs 78 | end 79 | 80 | -- layer that expands features out so we can forward multiple sentences per image 81 | local layer, parent = torch.class('nn.FeatExpander', 'nn.Module') 82 | function layer:__init(n) 83 | parent.__init(self) 84 | self.n = n 85 | end 86 | function layer:updateOutput(input) 87 | if self.n == 1 then self.output = input; return self.output end -- act as a noop for efficiency 88 | -- simply expands out the features. Performs a copy information 89 | assert(input:nDimension() == 2) 90 | local d = input:size(2) 91 | self.output:resize(input:size(1)*self.n, d) 92 | for k=1,input:size(1) do 93 | local j = (k-1)*self.n+1 94 | self.output[{ {j,j+self.n-1} }] = input[{ {k,k}, {} }]:expand(self.n, d) -- copy over 95 | end 96 | return self.output 97 | end 98 | function layer:updateGradInput(input, gradOutput) 99 | if self.n == 1 then self.gradInput = gradOutput; return self.gradInput end -- act as noop for efficiency 100 | -- add up the gradients for each block of expanded features 101 | self.gradInput:resizeAs(input) 102 | local d = input:size(2) 103 | for k=1,input:size(1) do 104 | local j = (k-1)*self.n+1 105 | self.gradInput[k] = torch.sum(gradOutput[{ {j,j+self.n-1} }], 1) 106 | end 107 | return self.gradInput 108 | end 109 | 110 | function net_utils.list_nngraph_modules(g) 111 | local omg = {} 112 | for i,node in ipairs(g.forwardnodes) do 113 | local m = node.data.module 114 | if m then 115 | table.insert(omg, m) 116 | end 117 | end 118 | return omg 119 | end 120 | function net_utils.listModules(net) 121 | -- torch, our relationship is a complicated love/hate thing. And right here it's the latter 122 | local t = torch.type(net) 123 | local moduleList 124 | if t == 'nn.gModule' then 125 | moduleList = net_utils.list_nngraph_modules(net) 126 | else 127 | moduleList = net:listModules() 128 | end 129 | return moduleList 130 | end 131 | function net_utils.sanitize_gradients(net) 132 | local moduleList = net_utils.listModules(net) 133 | for k,m in ipairs(moduleList) do 134 | if m.weight and m.gradWeight then 135 | --print('sanitizing gradWeight in of size ' .. m.gradWeight:nElement()) 136 | --print(m.weight:size()) 137 | m.gradWeight = nil 138 | end 139 | if m.bias and m.gradBias then 140 | --print('sanitizing gradWeight in of size ' .. m.gradBias:nElement()) 141 | --print(m.bias:size()) 142 | m.gradBias = nil 143 | end 144 | end 145 | end 146 | 147 | function net_utils.unsanitize_gradients(net) 148 | local moduleList = net_utils.listModules(net) 149 | for k,m in ipairs(moduleList) do 150 | if m.weight and (not m.gradWeight) then 151 | m.gradWeight = m.weight:clone():zero() 152 | --print('unsanitized gradWeight in of size ' .. m.gradWeight:nElement()) 153 | --print(m.weight:size()) 154 | end 155 | if m.bias and (not m.gradBias) then 156 | m.gradBias = m.bias:clone():zero() 157 | --print('unsanitized gradWeight in of size ' .. m.gradBias:nElement()) 158 | --print(m.bias:size()) 159 | end 160 | end 161 | end 162 | 163 | --[[ 164 | take a LongTensor of size DxN with elements 1..vocab_size+1 165 | (where last dimension is END token), and decode it into table of raw text sentences. 166 | each column is a sequence. ix_to_word gives the mapping to strings, as a table 167 | --]] 168 | function net_utils.decode_sequence(ix_to_word, seq) 169 | local D,N = seq:size(1), seq:size(2) 170 | local out = {} 171 | for i=1,N do 172 | local txt = '' 173 | for j=1,D do 174 | local ix = seq[{j,i}] 175 | local word = ix_to_word[tostring(ix)] 176 | if not word then break end -- END token, likely. Or null token 177 | if j >= 2 then txt = txt .. ' ' end 178 | txt = txt .. word 179 | end 180 | table.insert(out, txt) 181 | end 182 | return out 183 | end 184 | 185 | function net_utils.clone_list(lst) 186 | -- takes list of tensors, clone all 187 | local new = {} 188 | for k,v in pairs(lst) do 189 | new[k] = v:clone() 190 | end 191 | return new 192 | end 193 | 194 | -- hiding this piece of code on the bottom of the file, in hopes that 195 | -- noone will ever find it. Lets just pretend it doesn't exist 196 | function net_utils.language_eval(predictions, id) 197 | -- this is gross, but we have to call coco python code. 198 | -- Not my favorite kind of thing, but here we go 199 | local out_struct = {val_predictions = predictions} 200 | utils.write_json('coco-caption/val' .. id .. '.json', out_struct) -- serialize to json (ew, so gross) 201 | os.execute('./misc/call_python_caption_eval.sh val' .. id .. '.json') -- i'm dying over here 202 | local result_struct = utils.read_json('coco-caption/val' .. id .. '.json_out.json') -- god forgive me 203 | return result_struct 204 | end 205 | 206 | return net_utils 207 | -------------------------------------------------------------------------------- /misc/optim_updates.lua: -------------------------------------------------------------------------------- 1 | 2 | -- optim, simple as it should be, written from scratch. That's how I roll 3 | 4 | function sgd(x, dx, lr) 5 | x:add(-lr, dx) 6 | end 7 | 8 | function sgdm(x, dx, lr, alpha, state) 9 | -- sgd with momentum, standard update 10 | if not state.v then 11 | state.v = x.new(#x):zero() 12 | end 13 | state.v:mul(alpha) 14 | state.v:add(lr, dx) 15 | x:add(-1, state.v) 16 | end 17 | 18 | function sgdmom(x, dx, lr, alpha, state) 19 | -- sgd momentum, uses nesterov update (reference: http://cs231n.github.io/neural-networks-3/#sgd) 20 | if not state.m then 21 | state.m = x.new(#x):zero() 22 | state.tmp = x.new(#x) 23 | end 24 | state.tmp:copy(state.m) 25 | state.m:mul(alpha):add(-lr, dx) 26 | x:add(-alpha, state.tmp) 27 | x:add(1+alpha, state.m) 28 | end 29 | 30 | function adagrad(x, dx, lr, epsilon, state) 31 | if not state.m then 32 | state.m = x.new(#x):zero() 33 | state.tmp = x.new(#x) 34 | end 35 | -- calculate new mean squared values 36 | state.m:addcmul(1.0, dx, dx) 37 | -- perform update 38 | state.tmp:sqrt(state.m):add(epsilon) 39 | x:addcdiv(-lr, dx, state.tmp) 40 | end 41 | 42 | -- rmsprop implementation, simple as it should be 43 | function rmsprop(x, dx, lr, alpha, epsilon, state) 44 | if not state.m then 45 | state.m = x.new(#x):zero() 46 | state.tmp = x.new(#x) 47 | end 48 | -- calculate new (leaky) mean squared values 49 | state.m:mul(alpha) 50 | state.m:addcmul(1.0-alpha, dx, dx) 51 | -- perform update 52 | state.tmp:sqrt(state.m):add(epsilon) 53 | x:addcdiv(-lr, dx, state.tmp) 54 | end 55 | 56 | function adam(x, dx, lr, beta1, beta2, epsilon, state) 57 | local beta1 = beta1 or 0.9 58 | local beta2 = beta2 or 0.999 59 | local epsilon = epsilon or 1e-8 60 | 61 | if not state.m then 62 | -- Initialization 63 | state.t = 0 64 | -- Exponential moving average of gradient values 65 | state.m = x.new(#dx):zero() 66 | -- Exponential moving average of squared gradient values 67 | state.v = x.new(#dx):zero() 68 | -- A tmp tensor to hold the sqrt(v) + epsilon 69 | state.tmp = x.new(#dx):zero() 70 | end 71 | 72 | -- Decay the first and second moment running average coefficient 73 | state.m:mul(beta1):add(1-beta1, dx) 74 | state.v:mul(beta2):addcmul(1-beta2, dx, dx) 75 | state.tmp:copy(state.v):sqrt():add(epsilon) 76 | 77 | state.t = state.t + 1 78 | local biasCorrection1 = 1 - beta1^state.t 79 | local biasCorrection2 = 1 - beta2^state.t 80 | local stepSize = lr * math.sqrt(biasCorrection2)/biasCorrection1 81 | 82 | -- perform update 83 | x:addcdiv(-stepSize, state.m, state.tmp) 84 | end 85 | -------------------------------------------------------------------------------- /misc/utils.lua: -------------------------------------------------------------------------------- 1 | local cjson = require 'cjson' 2 | local utils = {} 3 | 4 | -- Assume required if default_value is nil 5 | function utils.getopt(opt, key, default_value) 6 | if default_value == nil and (opt == nil or opt[key] == nil) then 7 | error('error: required key ' .. key .. ' was not provided in an opt.') 8 | end 9 | if opt == nil then return default_value end 10 | local v = opt[key] 11 | if v == nil then v = default_value end 12 | return v 13 | end 14 | 15 | function utils.read_json(path) 16 | local file = io.open(path, 'r') 17 | local text = file:read() 18 | file:close() 19 | local info = cjson.decode(text) 20 | return info 21 | end 22 | 23 | function utils.write_json(path, j) 24 | -- API reference http://www.kyne.com.au/~mark/software/lua-cjson-manual.html#encode 25 | cjson.encode_sparse_array(true, 2, 10) 26 | local text = cjson.encode(j) 27 | local file = io.open(path, 'w') 28 | file:write(text) 29 | file:close() 30 | end 31 | 32 | -- dicts is a list of tables of k:v pairs, create a single 33 | -- k:v table that has the mean of the v's for each k 34 | -- assumes that all dicts have same keys always 35 | function utils.dict_average(dicts) 36 | local dict = {} 37 | local n = 0 38 | for i,d in pairs(dicts) do 39 | for k,v in pairs(d) do 40 | if dict[k] == nil then dict[k] = 0 end 41 | dict[k] = dict[k] + v 42 | end 43 | n=n+1 44 | end 45 | for k,v in pairs(dict) do 46 | dict[k] = dict[k] / n -- produce the average 47 | end 48 | return dict 49 | end 50 | 51 | -- seriously this is kind of ridiculous 52 | function utils.count_keys(t) 53 | local n = 0 54 | for k,v in pairs(t) do 55 | n = n + 1 56 | end 57 | return n 58 | end 59 | 60 | -- return average of all values in a table... 61 | function utils.average_values(t) 62 | local n = 0 63 | local vsum = 0 64 | for k,v in pairs(t) do 65 | vsum = vsum + v 66 | n = n + 1 67 | end 68 | return vsum / n 69 | end 70 | 71 | return utils 72 | -------------------------------------------------------------------------------- /prepro.py: -------------------------------------------------------------------------------- 1 | """ 2 | Preprocess a raw json dataset into hdf5/json files for use in data_loader.lua 3 | 4 | Input: json file that has the form 5 | [{ file_path: 'path/img.jpg', captions: ['a caption', ...] }, ...] 6 | example element in this list would look like 7 | {'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895} 8 | 9 | This script reads this json, does some basic preprocessing on the captions 10 | (e.g. lowercase, etc.), creates a special UNK token, and encodes everything to arrays 11 | 12 | Output: a json file and an hdf5 file 13 | The hdf5 file contains several fields: 14 | /images is (N,3,256,256) uint8 array of raw image data in RGB format 15 | /labels is (M,max_length) uint32 array of encoded labels, zero padded 16 | /label_start_ix and /label_end_ix are (N,) uint32 arrays of pointers to the 17 | first and last indices (in range 1..M) of labels for each image 18 | /label_length stores the length of the sequence for each of the M sequences 19 | 20 | The json file has a dict that contains: 21 | - an 'ix_to_word' field storing the vocab in form {ix:'word'}, where ix is 1-indexed 22 | - an 'images' field that is a list holding auxiliary information for each image, 23 | such as in particular the 'split' it was assigned to. 24 | """ 25 | 26 | import os 27 | import json 28 | import argparse 29 | from random import shuffle, seed 30 | import string 31 | # non-standard dependencies: 32 | import h5py 33 | import numpy as np 34 | from scipy.misc import imread, imresize 35 | 36 | def prepro_captions(imgs): 37 | 38 | # preprocess all the captions 39 | print 'example processed tokens:' 40 | for i,img in enumerate(imgs): 41 | img['processed_tokens'] = [] 42 | for j,s in enumerate(img['captions']): 43 | txt = str(s).lower().translate(None, string.punctuation).strip().split() 44 | img['processed_tokens'].append(txt) 45 | if i < 10 and j == 0: print txt 46 | 47 | def build_vocab(imgs, params): 48 | count_thr = params['word_count_threshold'] 49 | 50 | # count up the number of words 51 | counts = {} 52 | for img in imgs: 53 | for txt in img['processed_tokens']: 54 | for w in txt: 55 | counts[w] = counts.get(w, 0) + 1 56 | cw = sorted([(count,w) for w,count in counts.iteritems()], reverse=True) 57 | print 'top words and their counts:' 58 | print '\n'.join(map(str,cw[:20])) 59 | 60 | # print some stats 61 | total_words = sum(counts.itervalues()) 62 | print 'total words:', total_words 63 | bad_words = [w for w,n in counts.iteritems() if n <= count_thr] 64 | vocab = [w for w,n in counts.iteritems() if n > count_thr] 65 | bad_count = sum(counts[w] for w in bad_words) 66 | print 'number of bad words: %d/%d = %.2f%%' % (len(bad_words), len(counts), len(bad_words)*100.0/len(counts)) 67 | print 'number of words in vocab would be %d' % (len(vocab), ) 68 | print 'number of UNKs: %d/%d = %.2f%%' % (bad_count, total_words, bad_count*100.0/total_words) 69 | 70 | # lets look at the distribution of lengths as well 71 | sent_lengths = {} 72 | for img in imgs: 73 | for txt in img['processed_tokens']: 74 | nw = len(txt) 75 | sent_lengths[nw] = sent_lengths.get(nw, 0) + 1 76 | max_len = max(sent_lengths.keys()) 77 | print 'max length sentence in raw data: ', max_len 78 | print 'sentence length distribution (count, number of words):' 79 | sum_len = sum(sent_lengths.values()) 80 | for i in xrange(max_len+1): 81 | print '%2d: %10d %f%%' % (i, sent_lengths.get(i,0), sent_lengths.get(i,0)*100.0/sum_len) 82 | 83 | # lets now produce the final annotations 84 | if bad_count > 0: 85 | # additional special UNK token we will use below to map infrequent words to 86 | print 'inserting the special UNK token' 87 | vocab.append('UNK') 88 | 89 | for img in imgs: 90 | img['final_captions'] = [] 91 | for txt in img['processed_tokens']: 92 | caption = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt] 93 | img['final_captions'].append(caption) 94 | 95 | return vocab 96 | 97 | def assign_splits(imgs, params): 98 | num_val = params['num_val'] 99 | num_test = params['num_test'] 100 | 101 | for i,img in enumerate(imgs): 102 | if i < num_val: 103 | img['split'] = 'val' 104 | elif i < num_val + num_test: 105 | img['split'] = 'test' 106 | else: 107 | img['split'] = 'train' 108 | 109 | print 'assigned %d to val, %d to test.' % (num_val, num_test) 110 | 111 | def encode_captions(imgs, params, wtoi): 112 | """ 113 | encode all captions into one large array, which will be 1-indexed. 114 | also produces label_start_ix and label_end_ix which store 1-indexed 115 | and inclusive (Lua-style) pointers to the first and last caption for 116 | each image in the dataset. 117 | """ 118 | 119 | max_length = params['max_length'] 120 | N = len(imgs) 121 | M = sum(len(img['final_captions']) for img in imgs) # total number of captions 122 | 123 | label_arrays = [] 124 | label_start_ix = np.zeros(N, dtype='uint32') # note: these will be one-indexed 125 | label_end_ix = np.zeros(N, dtype='uint32') 126 | label_length = np.zeros(M, dtype='uint32') 127 | caption_counter = 0 128 | counter = 1 129 | for i,img in enumerate(imgs): 130 | n = len(img['final_captions']) 131 | assert n > 0, 'error: some image has no captions' 132 | 133 | Li = np.zeros((n, max_length), dtype='uint32') 134 | for j,s in enumerate(img['final_captions']): 135 | label_length[caption_counter] = min(max_length, len(s)) # record the length of this sequence 136 | caption_counter += 1 137 | for k,w in enumerate(s): 138 | if k < max_length: 139 | Li[j,k] = wtoi[w] 140 | 141 | # note: word indices are 1-indexed, and captions are padded with zeros 142 | label_arrays.append(Li) 143 | label_start_ix[i] = counter 144 | label_end_ix[i] = counter + n - 1 145 | 146 | counter += n 147 | 148 | L = np.concatenate(label_arrays, axis=0) # put all the labels together 149 | assert L.shape[0] == M, 'lengths don\'t match? that\'s weird' 150 | assert np.all(label_length > 0), 'error: some caption had no words?' 151 | 152 | print 'encoded captions to array of size ', `L.shape` 153 | return L, label_start_ix, label_end_ix, label_length 154 | 155 | def main(params): 156 | 157 | imgs = json.load(open(params['input_json'], 'r')) 158 | seed(123) # make reproducible 159 | shuffle(imgs) # shuffle the order 160 | 161 | # tokenization and preprocessing 162 | prepro_captions(imgs) 163 | 164 | # create the vocab 165 | vocab = build_vocab(imgs, params) 166 | itow = {i+1:w for i,w in enumerate(vocab)} # a 1-indexed vocab translation table 167 | wtoi = {w:i+1 for i,w in enumerate(vocab)} # inverse table 168 | 169 | # assign the splits 170 | assign_splits(imgs, params) 171 | 172 | # encode captions in large arrays, ready to ship to hdf5 file 173 | L, label_start_ix, label_end_ix, label_length = encode_captions(imgs, params, wtoi) 174 | 175 | # create output h5 file 176 | N = len(imgs) 177 | f = h5py.File(params['output_h5'], "w") 178 | f.create_dataset("labels", dtype='uint32', data=L) 179 | f.create_dataset("label_start_ix", dtype='uint32', data=label_start_ix) 180 | f.create_dataset("label_end_ix", dtype='uint32', data=label_end_ix) 181 | f.create_dataset("label_length", dtype='uint32', data=label_length) 182 | dset = f.create_dataset("images", (N,3,256,256), dtype='uint8') # space for resized images 183 | for i,img in enumerate(imgs): 184 | # load the image 185 | I = imread(os.path.join(params['images_root'], img['file_path'])) 186 | try: 187 | Ir = imresize(I, (256,256)) 188 | except: 189 | print 'failed resizing image %s - see http://git.io/vBIE0' % (img['file_path'],) 190 | raise 191 | # handle grayscale input images 192 | if len(Ir.shape) == 2: 193 | Ir = Ir[:,:,np.newaxis] 194 | Ir = np.concatenate((Ir,Ir,Ir), axis=2) 195 | # and swap order of axes from (256,256,3) to (3,256,256) 196 | Ir = Ir.transpose(2,0,1) 197 | # write to h5 198 | dset[i] = Ir 199 | if i % 1000 == 0: 200 | print 'processing %d/%d (%.2f%% done)' % (i, N, i*100.0/N) 201 | f.close() 202 | print 'wrote ', params['output_h5'] 203 | 204 | # create output json file 205 | out = {} 206 | out['ix_to_word'] = itow # encode the (1-indexed) vocab 207 | out['images'] = [] 208 | for i,img in enumerate(imgs): 209 | 210 | jimg = {} 211 | jimg['split'] = img['split'] 212 | if 'file_path' in img: jimg['file_path'] = img['file_path'] # copy it over, might need 213 | if 'id' in img: jimg['id'] = img['id'] # copy over & mantain an id, if present (e.g. coco ids, useful) 214 | 215 | out['images'].append(jimg) 216 | 217 | json.dump(out, open(params['output_json'], 'w')) 218 | print 'wrote ', params['output_json'] 219 | 220 | if __name__ == "__main__": 221 | 222 | parser = argparse.ArgumentParser() 223 | 224 | # input json 225 | parser.add_argument('--input_json', required=True, help='input json file to process into hdf5') 226 | parser.add_argument('--num_val', required=True, type=int, help='number of images to assign to validation data (for CV etc)') 227 | parser.add_argument('--output_json', default='data.json', help='output json file') 228 | parser.add_argument('--output_h5', default='data.h5', help='output h5 file') 229 | 230 | # options 231 | parser.add_argument('--max_length', default=16, type=int, help='max length of a caption, in number of words. captions longer than this get clipped.') 232 | parser.add_argument('--images_root', default='', help='root location in which images are stored, to be prepended to file_path in input json') 233 | parser.add_argument('--word_count_threshold', default=5, type=int, help='only words that occur more than this number of times will be put in vocab') 234 | parser.add_argument('--num_test', default=0, type=int, help='number of test images (to withold until very very end)') 235 | 236 | args = parser.parse_args() 237 | params = vars(args) # convert to ordinary dict 238 | print 'parsed input parameters:' 239 | print json.dumps(params, indent = 2) 240 | main(params) 241 | -------------------------------------------------------------------------------- /test_language_model.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | Unit tests for the LanguageModel implementation, making sure 3 | that nothing crashes, that we can overfit a small dataset 4 | and that everything gradient checks. 5 | --]] 6 | 7 | require 'torch' 8 | require 'misc.LanguageModel' 9 | 10 | local gradcheck = require 'misc.gradcheck' 11 | 12 | local tests = {} 13 | local tester = torch.Tester() 14 | 15 | -- validates the size and dimensions of a given 16 | -- tensor a to be size given in table sz 17 | function tester:assertTensorSizeEq(a, sz) 18 | tester:asserteq(a:nDimension(), #sz) 19 | for i=1,#sz do 20 | tester:asserteq(a:size(i), sz[i]) 21 | end 22 | end 23 | 24 | -- Test the API of the Language Model 25 | local function forwardApiTestFactory(dtype) 26 | if dtype == 'torch.CudaTensor' then 27 | require 'cutorch' 28 | require 'cunn' 29 | end 30 | local function f() 31 | -- create LanguageModel instance 32 | local opt = {} 33 | opt.vocab_size = 5 34 | opt.input_encoding_size = 11 35 | opt.rnn_size = 8 36 | opt.num_layers = 2 37 | opt.dropout = 0 38 | opt.seq_length = 7 39 | opt.batch_size = 10 40 | local lm = nn.LanguageModel(opt) 41 | local crit = nn.LanguageModelCriterion() 42 | lm:type(dtype) 43 | crit:type(dtype) 44 | 45 | -- construct some input to feed in 46 | local seq = torch.LongTensor(opt.seq_length, opt.batch_size):random(opt.vocab_size) 47 | -- make sure seq can be padded with zeroes and that things work ok 48 | seq[{ {4, 7}, 1 }] = 0 49 | seq[{ {5, 7}, 6 }] = 0 50 | local imgs = torch.randn(opt.batch_size, opt.input_encoding_size):type(dtype) 51 | local output = lm:forward{imgs, seq} 52 | tester:assertlt(torch.max(output:view(-1)), 0) -- log probs should be <0 53 | 54 | -- the output should be of size (seq_length + 2, batch_size, vocab_size + 1) 55 | -- where the +1 is for the special END token appended at the end. 56 | tester:assertTensorSizeEq(output, {opt.seq_length+2, opt.batch_size, opt.vocab_size+1}) 57 | 58 | local loss = crit:forward(output, seq) 59 | 60 | local gradOutput = crit:backward(output, seq) 61 | tester:assertTensorSizeEq(gradOutput, {opt.seq_length+2, opt.batch_size, opt.vocab_size+1}) 62 | 63 | -- make sure the pattern of zero gradients is as expected 64 | local gradAbs = torch.max(torch.abs(gradOutput), 3):view(opt.seq_length+2, opt.batch_size) 65 | local gradZeroMask = torch.eq(gradAbs,0) 66 | local expectedGradZeroMask = torch.ByteTensor(opt.seq_length+2,opt.batch_size):zero() 67 | expectedGradZeroMask[{ {1}, {} }]:fill(1) -- first time step should be zero grad (img was passed in) 68 | expectedGradZeroMask[{ {6,9}, 1 }]:fill(1) 69 | expectedGradZeroMask[{ {7,9}, 6 }]:fill(1) 70 | tester:assertTensorEq(gradZeroMask:float(), expectedGradZeroMask:float(), 1e-8) 71 | 72 | local gradInput = lm:backward({imgs, seq}, gradOutput) 73 | tester:assertTensorSizeEq(gradInput[1], {opt.batch_size, opt.input_encoding_size}) 74 | tester:asserteq(gradInput[2]:nElement(), 0, 'grad on seq should be empty tensor') 75 | 76 | end 77 | return f 78 | end 79 | 80 | -- test just the language model alone (without the criterion) 81 | local function gradCheckLM() 82 | 83 | local dtype = 'torch.DoubleTensor' 84 | local opt = {} 85 | opt.vocab_size = 5 86 | opt.input_encoding_size = 4 87 | opt.rnn_size = 8 88 | opt.num_layers = 2 89 | opt.dropout = 0 90 | opt.seq_length = 7 91 | opt.batch_size = 6 92 | local lm = nn.LanguageModel(opt) 93 | local crit = nn.LanguageModelCriterion() 94 | lm:type(dtype) 95 | crit:type(dtype) 96 | 97 | local seq = torch.LongTensor(opt.seq_length, opt.batch_size):random(opt.vocab_size) 98 | seq[{ {4, 7}, 1 }] = 0 99 | seq[{ {5, 7}, 4 }] = 0 100 | local imgs = torch.randn(opt.batch_size, opt.input_encoding_size):type(dtype) 101 | 102 | -- evaluate the analytic gradient 103 | local output = lm:forward{imgs, seq} 104 | local w = torch.randn(output:size(1), output:size(2), output:size(3)) 105 | -- generate random weighted sum criterion 106 | local loss = torch.sum(torch.cmul(output, w)) 107 | local gradOutput = w 108 | local gradInput, dummy = unpack(lm:backward({imgs, seq}, gradOutput)) 109 | 110 | -- create a loss function wrapper 111 | local function f(x) 112 | local output = lm:forward{x, seq} 113 | local loss = torch.sum(torch.cmul(output, w)) 114 | return loss 115 | end 116 | 117 | local gradInput_num = gradcheck.numeric_gradient(f, imgs, 1, 1e-6) 118 | 119 | -- print(gradInput) 120 | -- print(gradInput_num) 121 | -- local g = gradInput:view(-1) 122 | -- local gn = gradInput_num:view(-1) 123 | -- for i=1,g:nElement() do 124 | -- local r = gradcheck.relative_error(g[i],gn[i]) 125 | -- print(i, g[i], gn[i], r) 126 | -- end 127 | 128 | tester:assertTensorEq(gradInput, gradInput_num, 1e-4) 129 | tester:assertlt(gradcheck.relative_error(gradInput, gradInput_num, 1e-8), 1e-4) 130 | end 131 | 132 | local function gradCheck() 133 | local dtype = 'torch.DoubleTensor' 134 | local opt = {} 135 | opt.vocab_size = 5 136 | opt.input_encoding_size = 4 137 | opt.rnn_size = 8 138 | opt.num_layers = 2 139 | opt.dropout = 0 140 | opt.seq_length = 7 141 | opt.batch_size = 6 142 | local lm = nn.LanguageModel(opt) 143 | local crit = nn.LanguageModelCriterion() 144 | lm:type(dtype) 145 | crit:type(dtype) 146 | 147 | local seq = torch.LongTensor(opt.seq_length, opt.batch_size):random(opt.vocab_size) 148 | seq[{ {4, 7}, 1 }] = 0 149 | seq[{ {5, 7}, 4 }] = 0 150 | local imgs = torch.randn(opt.batch_size, opt.input_encoding_size):type(dtype) 151 | 152 | -- evaluate the analytic gradient 153 | local output = lm:forward{imgs, seq} 154 | local loss = crit:forward(output, seq) 155 | local gradOutput = crit:backward(output, seq) 156 | local gradInput, dummy = unpack(lm:backward({imgs, seq}, gradOutput)) 157 | 158 | -- create a loss function wrapper 159 | local function f(x) 160 | local output = lm:forward{x, seq} 161 | local loss = crit:forward(output, seq) 162 | return loss 163 | end 164 | 165 | local gradInput_num = gradcheck.numeric_gradient(f, imgs, 1, 1e-6) 166 | 167 | -- print(gradInput) 168 | -- print(gradInput_num) 169 | -- local g = gradInput:view(-1) 170 | -- local gn = gradInput_num:view(-1) 171 | -- for i=1,g:nElement() do 172 | -- local r = gradcheck.relative_error(g[i],gn[i]) 173 | -- print(i, g[i], gn[i], r) 174 | -- end 175 | 176 | tester:assertTensorEq(gradInput, gradInput_num, 1e-4) 177 | tester:assertlt(gradcheck.relative_error(gradInput, gradInput_num, 1e-8), 5e-4) 178 | end 179 | 180 | local function overfit() 181 | local dtype = 'torch.DoubleTensor' 182 | local opt = {} 183 | opt.vocab_size = 5 184 | opt.input_encoding_size = 7 185 | opt.rnn_size = 24 186 | opt.num_layers = 1 187 | opt.dropout = 0 188 | opt.seq_length = 7 189 | opt.batch_size = 6 190 | local lm = nn.LanguageModel(opt) 191 | local crit = nn.LanguageModelCriterion() 192 | lm:type(dtype) 193 | crit:type(dtype) 194 | 195 | local seq = torch.LongTensor(opt.seq_length, opt.batch_size):random(opt.vocab_size) 196 | seq[{ {4, 7}, 1 }] = 0 197 | seq[{ {5, 7}, 4 }] = 0 198 | local imgs = torch.randn(opt.batch_size, opt.input_encoding_size):type(dtype) 199 | 200 | local params, grad_params = lm:getParameters() 201 | print('number of parameters:', params:nElement(), grad_params:nElement()) 202 | local lstm_params = 4*(opt.input_encoding_size + opt.rnn_size)*opt.rnn_size + opt.rnn_size*4*2 203 | local output_params = opt.rnn_size * (opt.vocab_size + 1) + opt.vocab_size+1 204 | local table_params = (opt.vocab_size + 1) * opt.input_encoding_size 205 | local expected_params = lstm_params + output_params + table_params 206 | print('expected:', expected_params) 207 | 208 | local function lossFun() 209 | grad_params:zero() 210 | local output = lm:forward{imgs, seq} 211 | local loss = crit:forward(output, seq) 212 | local gradOutput = crit:backward(output, seq) 213 | lm:backward({imgs, seq}, gradOutput) 214 | return loss 215 | end 216 | 217 | local loss 218 | local grad_cache = grad_params:clone():fill(1e-8) 219 | print('trying to overfit the language model on toy data:') 220 | for t=1,30 do 221 | loss = lossFun() 222 | -- test that initial loss makes sense 223 | if t == 1 then tester:assertlt(math.abs(math.log(opt.vocab_size+1) - loss), 0.1) end 224 | grad_cache:addcmul(1, grad_params, grad_params) 225 | params:addcdiv(-1e-1, grad_params, torch.sqrt(grad_cache)) -- adagrad update 226 | print(string.format('iteration %d/30: loss %f', t, loss)) 227 | end 228 | -- holy crap adagrad destroys the loss function! 229 | 230 | tester:assertlt(loss, 0.2) 231 | end 232 | 233 | -- check that we can call :sample() and that correct-looking things happen 234 | local function sample() 235 | local dtype = 'torch.DoubleTensor' 236 | local opt = {} 237 | opt.vocab_size = 5 238 | opt.input_encoding_size = 4 239 | opt.rnn_size = 8 240 | opt.num_layers = 2 241 | opt.dropout = 0 242 | opt.seq_length = 7 243 | opt.batch_size = 6 244 | local lm = nn.LanguageModel(opt) 245 | 246 | local imgs = torch.randn(opt.batch_size, opt.input_encoding_size):type(dtype) 247 | local seq = lm:sample(imgs) 248 | 249 | tester:assertTensorSizeEq(seq, {opt.seq_length, opt.batch_size}) 250 | tester:asserteq(seq:type(), 'torch.LongTensor') 251 | tester:assertge(torch.min(seq), 1) 252 | tester:assertle(torch.max(seq), opt.vocab_size+1) 253 | print('\nsampled sequence:') 254 | print(seq) 255 | end 256 | 257 | 258 | -- check that we can call :sample_beam() and that correct-looking things happen 259 | -- these are not very exhaustive tests and basic sanity checks 260 | local function sample_beam() 261 | local dtype = 'torch.DoubleTensor' 262 | torch.manualSeed(1) 263 | 264 | local opt = {} 265 | opt.vocab_size = 10 266 | opt.input_encoding_size = 4 267 | opt.rnn_size = 8 268 | opt.num_layers = 1 269 | opt.dropout = 0 270 | opt.seq_length = 7 271 | opt.batch_size = 6 272 | local lm = nn.LanguageModel(opt) 273 | 274 | local imgs = torch.randn(opt.batch_size, opt.input_encoding_size):type(dtype) 275 | 276 | local seq_vanilla, logprobs_vanilla = lm:sample(imgs) 277 | local seq, logprobs = lm:sample(imgs, {beam_size = 1}) 278 | 279 | -- check some basic I/O, types, etc. 280 | tester:assertTensorSizeEq(seq, {opt.seq_length, opt.batch_size}) 281 | tester:asserteq(seq:type(), 'torch.LongTensor') 282 | tester:assertge(torch.min(seq), 0) 283 | tester:assertle(torch.max(seq), opt.vocab_size+1) 284 | 285 | -- doing beam search with beam size 1 should return exactly what we had before 286 | print('') 287 | print('vanilla sampling:') 288 | print(seq_vanilla) 289 | print('beam search sampling with beam size 1:') 290 | print(seq) 291 | tester:assertTensorEq(seq_vanilla, seq, 0) -- these are LongTensors, expect exact match 292 | tester:assertTensorEq(logprobs_vanilla, logprobs, 1e-6) -- logprobs too 293 | 294 | -- doing beam search with higher beam size should yield higher likelihood sequences 295 | local seq2, logprobs2 = lm:sample(imgs, {beam_size = 8}) 296 | local logsum = torch.sum(logprobs, 1) 297 | local logsum2 = torch.sum(logprobs2, 1) 298 | print('') 299 | print('beam search sampling with beam size 1:') 300 | print(seq) 301 | print('beam search sampling with beam size 8:') 302 | print(seq2) 303 | print('logprobs:') 304 | print(logsum) 305 | print(logsum2) 306 | 307 | -- the logprobs should always be >=, since beam_search is better argmax inference 308 | tester:assert(torch.all(torch.gt(logsum2, logsum))) 309 | end 310 | 311 | tests.doubleApiForwardTest = forwardApiTestFactory('torch.DoubleTensor') 312 | tests.floatApiForwardTest = forwardApiTestFactory('torch.FloatTensor') 313 | tests.cudaApiForwardTest = forwardApiTestFactory('torch.CudaTensor') 314 | tests.gradCheck = gradCheck 315 | tests.gradCheckLM = gradCheckLM 316 | tests.overfit = overfit 317 | tests.sample = sample 318 | tests.sample_beam = sample_beam 319 | 320 | tester:add(tests) 321 | tester:run() 322 | -------------------------------------------------------------------------------- /train.lua: -------------------------------------------------------------------------------- 1 | 2 | require 'torch' 3 | require 'nn' 4 | require 'nngraph' 5 | -- exotic things 6 | require 'loadcaffe' 7 | -- local imports 8 | local utils = require 'misc.utils' 9 | require 'misc.DataLoader' 10 | require 'misc.LanguageModel' 11 | local net_utils = require 'misc.net_utils' 12 | require 'misc.optim_updates' 13 | 14 | ------------------------------------------------------------------------------- 15 | -- Input arguments and options 16 | ------------------------------------------------------------------------------- 17 | cmd = torch.CmdLine() 18 | cmd:text() 19 | cmd:text('Train an Image Captioning model') 20 | cmd:text() 21 | cmd:text('Options') 22 | 23 | -- Data input settings 24 | cmd:option('-input_h5','coco/data.h5','path to the h5file containing the preprocessed dataset') 25 | cmd:option('-input_json','coco/data.json','path to the json file containing additional info and vocab') 26 | cmd:option('-cnn_proto','model/VGG_ILSVRC_16_layers_deploy.prototxt','path to CNN prototxt file in Caffe format. Note this MUST be a VGGNet-16 right now.') 27 | cmd:option('-cnn_model','model/VGG_ILSVRC_16_layers.caffemodel','path to CNN model file containing the weights, Caffe format. Note this MUST be a VGGNet-16 right now.') 28 | cmd:option('-start_from', '', 'path to a model checkpoint to initialize model weights from. Empty = don\'t') 29 | 30 | -- Model settings 31 | cmd:option('-rnn_size',512,'size of the rnn in number of hidden nodes in each layer') 32 | cmd:option('-input_encoding_size',512,'the encoding size of each token in the vocabulary, and the image.') 33 | 34 | -- Optimization: General 35 | cmd:option('-max_iters', -1, 'max number of iterations to run for (-1 = run forever)') 36 | cmd:option('-batch_size',16,'what is the batch size in number of images per batch? (there will be x seq_per_img sentences)') 37 | cmd:option('-grad_clip',0.1,'clip gradients at this value (note should be lower than usual 5 because we normalize grads by both batch and seq_length)') 38 | cmd:option('-drop_prob_lm', 0.5, 'strength of dropout in the Language Model RNN') 39 | cmd:option('-finetune_cnn_after', -1, 'After what iteration do we start finetuning the CNN? (-1 = disable; never finetune, 0 = finetune from start)') 40 | cmd:option('-seq_per_img',5,'number of captions to sample for each image during training. Done for efficiency since CNN forward pass is expensive. E.g. coco has 5 sents/image') 41 | -- Optimization: for the Language Model 42 | cmd:option('-optim','adam','what update to use? rmsprop|sgd|sgdmom|adagrad|adam') 43 | cmd:option('-learning_rate',4e-4,'learning rate') 44 | cmd:option('-learning_rate_decay_start', -1, 'at what iteration to start decaying learning rate? (-1 = dont)') 45 | cmd:option('-learning_rate_decay_every', 50000, 'every how many iterations thereafter to drop LR by half?') 46 | cmd:option('-optim_alpha',0.8,'alpha for adagrad/rmsprop/momentum/adam') 47 | cmd:option('-optim_beta',0.999,'beta used for adam') 48 | cmd:option('-optim_epsilon',1e-8,'epsilon that goes into denominator for smoothing') 49 | -- Optimization: for the CNN 50 | cmd:option('-cnn_optim','adam','optimization to use for CNN') 51 | cmd:option('-cnn_optim_alpha',0.8,'alpha for momentum of CNN') 52 | cmd:option('-cnn_optim_beta',0.999,'alpha for momentum of CNN') 53 | cmd:option('-cnn_learning_rate',1e-5,'learning rate for the CNN') 54 | cmd:option('-cnn_weight_decay', 0, 'L2 weight decay just for the CNN') 55 | 56 | -- Evaluation/Checkpointing 57 | cmd:option('-val_images_use', 3200, 'how many images to use when periodically evaluating the validation loss? (-1 = all)') 58 | cmd:option('-save_checkpoint_every', 2500, 'how often to save a model checkpoint?') 59 | cmd:option('-checkpoint_path', '', 'folder to save checkpoints into (empty = this folder)') 60 | cmd:option('-language_eval', 0, 'Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.') 61 | cmd:option('-losses_log_every', 25, 'How often do we snapshot losses, for inclusion in the progress dump? (0 = disable)') 62 | 63 | -- misc 64 | cmd:option('-backend', 'cudnn', 'nn|cudnn') 65 | cmd:option('-id', '', 'an id identifying this run/job. used in cross-val and appended when writing progress files') 66 | cmd:option('-seed', 123, 'random number generator seed to use') 67 | cmd:option('-gpuid', 0, 'which gpu to use. -1 = use CPU') 68 | 69 | cmd:text() 70 | 71 | ------------------------------------------------------------------------------- 72 | -- Basic Torch initializations 73 | ------------------------------------------------------------------------------- 74 | local opt = cmd:parse(arg) 75 | torch.manualSeed(opt.seed) 76 | torch.setdefaulttensortype('torch.FloatTensor') -- for CPU 77 | 78 | if opt.gpuid >= 0 then 79 | require 'cutorch' 80 | require 'cunn' 81 | if opt.backend == 'cudnn' then require 'cudnn' end 82 | cutorch.manualSeed(opt.seed) 83 | cutorch.setDevice(opt.gpuid + 1) -- note +1 because lua is 1-indexed 84 | end 85 | 86 | ------------------------------------------------------------------------------- 87 | -- Create the Data Loader instance 88 | ------------------------------------------------------------------------------- 89 | local loader = DataLoader{h5_file = opt.input_h5, json_file = opt.input_json} 90 | 91 | ------------------------------------------------------------------------------- 92 | -- Initialize the networks 93 | ------------------------------------------------------------------------------- 94 | local protos = {} 95 | 96 | if string.len(opt.start_from) > 0 then 97 | -- load protos from file 98 | print('initializing weights from ' .. opt.start_from) 99 | local loaded_checkpoint = torch.load(opt.start_from) 100 | protos = loaded_checkpoint.protos 101 | net_utils.unsanitize_gradients(protos.cnn) 102 | local lm_modules = protos.lm:getModulesList() 103 | for k,v in pairs(lm_modules) do net_utils.unsanitize_gradients(v) end 104 | protos.crit = nn.LanguageModelCriterion() -- not in checkpoints, create manually 105 | protos.expander = nn.FeatExpander(opt.seq_per_img) -- not in checkpoints, create manually 106 | else 107 | -- create protos from scratch 108 | -- intialize language model 109 | local lmOpt = {} 110 | lmOpt.vocab_size = loader:getVocabSize() 111 | lmOpt.input_encoding_size = opt.input_encoding_size 112 | lmOpt.rnn_size = opt.rnn_size 113 | lmOpt.num_layers = 1 114 | lmOpt.dropout = opt.drop_prob_lm 115 | lmOpt.seq_length = loader:getSeqLength() 116 | lmOpt.batch_size = opt.batch_size * opt.seq_per_img 117 | protos.lm = nn.LanguageModel(lmOpt) 118 | -- initialize the ConvNet 119 | local cnn_backend = opt.backend 120 | if opt.gpuid == -1 then cnn_backend = 'nn' end -- override to nn if gpu is disabled 121 | local cnn_raw = loadcaffe.load(opt.cnn_proto, opt.cnn_model, cnn_backend) 122 | protos.cnn = net_utils.build_cnn(cnn_raw, {encoding_size = opt.input_encoding_size, backend = cnn_backend}) 123 | -- initialize a special FeatExpander module that "corrects" for the batch number discrepancy 124 | -- where we have multiple captions per one image in a batch. This is done for efficiency 125 | -- because doing a CNN forward pass is expensive. We expand out the CNN features for each sentence 126 | protos.expander = nn.FeatExpander(opt.seq_per_img) 127 | -- criterion for the language model 128 | protos.crit = nn.LanguageModelCriterion() 129 | end 130 | 131 | -- ship everything to GPU, maybe 132 | if opt.gpuid >= 0 then 133 | for k,v in pairs(protos) do v:cuda() end 134 | end 135 | 136 | -- flatten and prepare all model parameters to a single vector. 137 | -- Keep CNN params separate in case we want to try to get fancy with different optims on LM/CNN 138 | local params, grad_params = protos.lm:getParameters() 139 | local cnn_params, cnn_grad_params = protos.cnn:getParameters() 140 | print('total number of parameters in LM: ', params:nElement()) 141 | print('total number of parameters in CNN: ', cnn_params:nElement()) 142 | assert(params:nElement() == grad_params:nElement()) 143 | assert(cnn_params:nElement() == cnn_grad_params:nElement()) 144 | 145 | -- construct thin module clones that share parameters with the actual 146 | -- modules. These thin module will have no intermediates and will be used 147 | -- for checkpointing to write significantly smaller checkpoint files 148 | local thin_lm = protos.lm:clone() 149 | thin_lm.core:share(protos.lm.core, 'weight', 'bias') -- TODO: we are assuming that LM has specific members! figure out clean way to get rid of, not modular. 150 | thin_lm.lookup_table:share(protos.lm.lookup_table, 'weight', 'bias') 151 | local thin_cnn = protos.cnn:clone('weight', 'bias') 152 | -- sanitize all modules of gradient storage so that we dont save big checkpoints 153 | net_utils.sanitize_gradients(thin_cnn) 154 | local lm_modules = thin_lm:getModulesList() 155 | for k,v in pairs(lm_modules) do net_utils.sanitize_gradients(v) end 156 | 157 | -- create clones and ensure parameter sharing. we have to do this 158 | -- all the way here at the end because calls such as :cuda() and 159 | -- :getParameters() reshuffle memory around. 160 | protos.lm:createClones() 161 | 162 | collectgarbage() -- "yeah, sure why not" 163 | ------------------------------------------------------------------------------- 164 | -- Validation evaluation 165 | ------------------------------------------------------------------------------- 166 | local function eval_split(split, evalopt) 167 | local verbose = utils.getopt(evalopt, 'verbose', true) 168 | local val_images_use = utils.getopt(evalopt, 'val_images_use', true) 169 | 170 | protos.cnn:evaluate() 171 | protos.lm:evaluate() 172 | loader:resetIterator(split) -- rewind iteator back to first datapoint in the split 173 | local n = 0 174 | local loss_sum = 0 175 | local loss_evals = 0 176 | local predictions = {} 177 | local vocab = loader:getVocab() 178 | while true do 179 | 180 | -- fetch a batch of data 181 | local data = loader:getBatch{batch_size = opt.batch_size, split = split, seq_per_img = opt.seq_per_img} 182 | data.images = net_utils.prepro(data.images, false, opt.gpuid >= 0) -- preprocess in place, and don't augment 183 | n = n + data.images:size(1) 184 | 185 | -- forward the model to get loss 186 | local feats = protos.cnn:forward(data.images) 187 | local expanded_feats = protos.expander:forward(feats) 188 | local logprobs = protos.lm:forward{expanded_feats, data.labels} 189 | local loss = protos.crit:forward(logprobs, data.labels) 190 | loss_sum = loss_sum + loss 191 | loss_evals = loss_evals + 1 192 | 193 | -- forward the model to also get generated samples for each image 194 | local seq = protos.lm:sample(feats) 195 | local sents = net_utils.decode_sequence(vocab, seq) 196 | for k=1,#sents do 197 | local entry = {image_id = data.infos[k].id, caption = sents[k]} 198 | table.insert(predictions, entry) 199 | if verbose then 200 | print(string.format('image %s: %s', entry.image_id, entry.caption)) 201 | end 202 | end 203 | 204 | -- if we wrapped around the split or used up val imgs budget then bail 205 | local ix0 = data.bounds.it_pos_now 206 | local ix1 = math.min(data.bounds.it_max, val_images_use) 207 | if verbose then 208 | print(string.format('evaluating validation performance... %d/%d (%f)', ix0-1, ix1, loss)) 209 | end 210 | 211 | if loss_evals % 10 == 0 then collectgarbage() end 212 | if data.bounds.wrapped then break end -- the split ran out of data, lets break out 213 | if n >= val_images_use then break end -- we've used enough images 214 | end 215 | 216 | local lang_stats 217 | if opt.language_eval == 1 then 218 | lang_stats = net_utils.language_eval(predictions, opt.id) 219 | end 220 | 221 | return loss_sum/loss_evals, predictions, lang_stats 222 | end 223 | 224 | ------------------------------------------------------------------------------- 225 | -- Loss function 226 | ------------------------------------------------------------------------------- 227 | local iter = 0 228 | local function lossFun() 229 | protos.cnn:training() 230 | protos.lm:training() 231 | grad_params:zero() 232 | if opt.finetune_cnn_after >= 0 and iter >= opt.finetune_cnn_after then 233 | cnn_grad_params:zero() 234 | end 235 | 236 | ----------------------------------------------------------------------------- 237 | -- Forward pass 238 | ----------------------------------------------------------------------------- 239 | -- get batch of data 240 | local data = loader:getBatch{batch_size = opt.batch_size, split = 'train', seq_per_img = opt.seq_per_img} 241 | data.images = net_utils.prepro(data.images, true, opt.gpuid >= 0) -- preprocess in place, do data augmentation 242 | -- data.images: Nx3x224x224 243 | -- data.seq: LxM where L is sequence length upper bound, and M = N*seq_per_img 244 | 245 | -- forward the ConvNet on images (most work happens here) 246 | local feats = protos.cnn:forward(data.images) 247 | -- we have to expand out image features, once for each sentence 248 | local expanded_feats = protos.expander:forward(feats) 249 | -- forward the language model 250 | local logprobs = protos.lm:forward{expanded_feats, data.labels} 251 | -- forward the language model criterion 252 | local loss = protos.crit:forward(logprobs, data.labels) 253 | 254 | ----------------------------------------------------------------------------- 255 | -- Backward pass 256 | ----------------------------------------------------------------------------- 257 | -- backprop criterion 258 | local dlogprobs = protos.crit:backward(logprobs, data.labels) 259 | -- backprop language model 260 | local dexpanded_feats, ddummy = unpack(protos.lm:backward({expanded_feats, data.labels}, dlogprobs)) 261 | -- backprop the CNN, but only if we are finetuning 262 | if opt.finetune_cnn_after >= 0 and iter >= opt.finetune_cnn_after then 263 | local dfeats = protos.expander:backward(feats, dexpanded_feats) 264 | local dx = protos.cnn:backward(data.images, dfeats) 265 | end 266 | 267 | -- clip gradients 268 | -- print(string.format('claming %f%% of gradients', 100*torch.mean(torch.gt(torch.abs(grad_params), opt.grad_clip)))) 269 | grad_params:clamp(-opt.grad_clip, opt.grad_clip) 270 | 271 | -- apply L2 regularization 272 | if opt.cnn_weight_decay > 0 then 273 | cnn_grad_params:add(opt.cnn_weight_decay, cnn_params) 274 | -- note: we don't bother adding the l2 loss to the total loss, meh. 275 | cnn_grad_params:clamp(-opt.grad_clip, opt.grad_clip) 276 | end 277 | ----------------------------------------------------------------------------- 278 | 279 | -- and lets get out! 280 | local losses = { total_loss = loss } 281 | return losses 282 | end 283 | 284 | ------------------------------------------------------------------------------- 285 | -- Main loop 286 | ------------------------------------------------------------------------------- 287 | local loss0 288 | local optim_state = {} 289 | local cnn_optim_state = {} 290 | local loss_history = {} 291 | local val_lang_stats_history = {} 292 | local val_loss_history = {} 293 | local best_score 294 | while true do 295 | 296 | -- eval loss/gradient 297 | local losses = lossFun() 298 | if iter % opt.losses_log_every == 0 then loss_history[iter] = losses.total_loss end 299 | print(string.format('iter %d: %f', iter, losses.total_loss)) 300 | 301 | -- save checkpoint once in a while (or on final iteration) 302 | if (iter % opt.save_checkpoint_every == 0 or iter == opt.max_iters) then 303 | 304 | -- evaluate the validation performance 305 | local val_loss, val_predictions, lang_stats = eval_split('val', {val_images_use = opt.val_images_use}) 306 | print('validation loss: ', val_loss) 307 | print(lang_stats) 308 | val_loss_history[iter] = val_loss 309 | if lang_stats then 310 | val_lang_stats_history[iter] = lang_stats 311 | end 312 | 313 | local checkpoint_path = path.join(opt.checkpoint_path, 'model_id' .. opt.id) 314 | 315 | -- write a (thin) json report 316 | local checkpoint = {} 317 | checkpoint.opt = opt 318 | checkpoint.iter = iter 319 | checkpoint.loss_history = loss_history 320 | checkpoint.val_loss_history = val_loss_history 321 | checkpoint.val_predictions = val_predictions -- save these too for CIDEr/METEOR/etc eval 322 | checkpoint.val_lang_stats_history = val_lang_stats_history 323 | 324 | utils.write_json(checkpoint_path .. '.json', checkpoint) 325 | print('wrote json checkpoint to ' .. checkpoint_path .. '.json') 326 | 327 | -- write the full model checkpoint as well if we did better than ever 328 | local current_score 329 | if lang_stats then 330 | -- use CIDEr score for deciding how well we did 331 | current_score = lang_stats['CIDEr'] 332 | else 333 | -- use the (negative) validation loss as a score 334 | current_score = -val_loss 335 | end 336 | if best_score == nil or current_score > best_score then 337 | best_score = current_score 338 | if iter > 0 then -- dont save on very first iteration 339 | -- include the protos (which have weights) and save to file 340 | local save_protos = {} 341 | save_protos.lm = thin_lm -- these are shared clones, and point to correct param storage 342 | save_protos.cnn = thin_cnn 343 | checkpoint.protos = save_protos 344 | -- also include the vocabulary mapping so that we can use the checkpoint 345 | -- alone to run on arbitrary images without the data loader 346 | checkpoint.vocab = loader:getVocab() 347 | torch.save(checkpoint_path .. '.t7', checkpoint) 348 | print('wrote checkpoint to ' .. checkpoint_path .. '.t7') 349 | end 350 | end 351 | end 352 | 353 | -- decay the learning rate for both LM and CNN 354 | local learning_rate = opt.learning_rate 355 | local cnn_learning_rate = opt.cnn_learning_rate 356 | if iter > opt.learning_rate_decay_start and opt.learning_rate_decay_start >= 0 then 357 | local frac = (iter - opt.learning_rate_decay_start) / opt.learning_rate_decay_every 358 | local decay_factor = math.pow(0.5, frac) 359 | learning_rate = learning_rate * decay_factor -- set the decayed rate 360 | cnn_learning_rate = cnn_learning_rate * decay_factor 361 | end 362 | 363 | -- perform a parameter update 364 | if opt.optim == 'rmsprop' then 365 | rmsprop(params, grad_params, learning_rate, opt.optim_alpha, opt.optim_epsilon, optim_state) 366 | elseif opt.optim == 'adagrad' then 367 | adagrad(params, grad_params, learning_rate, opt.optim_epsilon, optim_state) 368 | elseif opt.optim == 'sgd' then 369 | sgd(params, grad_params, opt.learning_rate) 370 | elseif opt.optim == 'sgdm' then 371 | sgdm(params, grad_params, learning_rate, opt.optim_alpha, optim_state) 372 | elseif opt.optim == 'sgdmom' then 373 | sgdmom(params, grad_params, learning_rate, opt.optim_alpha, optim_state) 374 | elseif opt.optim == 'adam' then 375 | adam(params, grad_params, learning_rate, opt.optim_alpha, opt.optim_beta, opt.optim_epsilon, optim_state) 376 | else 377 | error('bad option opt.optim') 378 | end 379 | 380 | -- do a cnn update (if finetuning, and if rnn above us is not warming up right now) 381 | if opt.finetune_cnn_after >= 0 and iter >= opt.finetune_cnn_after then 382 | if opt.cnn_optim == 'sgd' then 383 | sgd(cnn_params, cnn_grad_params, cnn_learning_rate) 384 | elseif opt.cnn_optim == 'sgdm' then 385 | sgdm(cnn_params, cnn_grad_params, cnn_learning_rate, opt.cnn_optim_alpha, cnn_optim_state) 386 | elseif opt.cnn_optim == 'adam' then 387 | adam(cnn_params, cnn_grad_params, cnn_learning_rate, opt.cnn_optim_alpha, opt.cnn_optim_beta, opt.optim_epsilon, cnn_optim_state) 388 | else 389 | error('bad option for opt.cnn_optim') 390 | end 391 | end 392 | 393 | -- stopping criterions 394 | iter = iter + 1 395 | if iter % 10 == 0 then collectgarbage() end -- good idea to do this once in a while, i think 396 | if loss0 == nil then loss0 = losses.total_loss end 397 | if losses.total_loss > loss0 * 20 then 398 | print('loss seems to be exploding, quitting.') 399 | break 400 | end 401 | if opt.max_iters > 0 and iter >= opt.max_iters then break end -- stopping criterion 402 | 403 | end 404 | -------------------------------------------------------------------------------- /videocaptioning.lua: -------------------------------------------------------------------------------- 1 | require 'torch' 2 | require 'nn' 3 | require 'nngraph' 4 | -- exotics 5 | -- local imports 6 | local utils = require 'misc.utils' 7 | require 'misc.DataLoader' 8 | require 'misc.DataLoaderRaw' 9 | require 'misc.LanguageModel' 10 | local net_utils = require 'misc.net_utils' 11 | 12 | local cv = require 'cv' 13 | require 'cv.highgui' 14 | require 'cv.videoio' 15 | require 'cv.imgcodecs' 16 | require 'cv.imgproc' 17 | 18 | ------------------------------------------------------------------------------- 19 | -- Input arguments and options 20 | ------------------------------------------------------------------------------- 21 | cmd = torch.CmdLine() 22 | cmd:text() 23 | cmd:text('Train an Image Captioning model') 24 | cmd:text() 25 | cmd:text('Options') 26 | 27 | -- Input paths 28 | cmd:option('-model','','path to model to evaluate') 29 | -- Basic options 30 | cmd:option('-batch_size', 1, 'if > 0 then overrule, otherwise load from checkpoint.') 31 | cmd:option('-num_images', 100, 'how many images to use when periodically evaluating the loss? (-1 = all)') 32 | cmd:option('-language_eval', 0, 'Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.') 33 | cmd:option('-dump_images', 1, 'Dump images into vis/imgs folder for vis? (1=yes,0=no)') 34 | cmd:option('-dump_json', 1, 'Dump json with predictions into vis folder? (1=yes,0=no)') 35 | cmd:option('-dump_path', 0, 'Write image paths along with predictions into vis json? (1=yes,0=no)') 36 | -- Sampling options 37 | cmd:option('-sample_max', 1, '1 = sample argmax words. 0 = sample from distributions.') 38 | cmd:option('-beam_size', 2, 'used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.') 39 | cmd:option('-temperature', 1.0, 'temperature when sampling from distributions (i.e. when sample_max = 0). Lower = "safer" predictions.') 40 | -- misc 41 | cmd:option('-backend', 'cudnn', 'nn|cudnn') 42 | cmd:option('-id', 'evalscript', 'an id identifying this run/job. used only if language_eval = 1 for appending to intermediate files') 43 | cmd:option('-seed', 123, 'random number generator seed to use') 44 | cmd:option('-gpuid', 0, 'which gpu to use. -1 = use CPU') 45 | cmd:text() 46 | 47 | ------------------------------------------------------------------------------- 48 | -- Basic Torch initializations 49 | ------------------------------------------------------------------------------- 50 | local opt = cmd:parse(arg) 51 | torch.manualSeed(opt.seed) 52 | torch.setdefaulttensortype('torch.FloatTensor') -- for CPU 53 | 54 | if opt.gpuid >= 0 then 55 | require 'cutorch' 56 | require 'cunn' 57 | if opt.backend == 'cudnn' then require 'cudnn' end 58 | cutorch.manualSeed(opt.seed) 59 | cutorch.setDevice(opt.gpuid + 1) -- note +1 because lua is 1-indexed 60 | end 61 | 62 | cv.namedWindow{winname="NeuralTalk2", flags=cv.WINDOW_AUTOSIZE} 63 | local cap = cv.VideoCapture{device=0} 64 | if not cap:isOpened() then 65 | print("Failed to open the default camera") 66 | os.exit(-1) 67 | end 68 | local _, frame = cap:read{} 69 | 70 | ------------------------------------------------------------------------------- 71 | -- Load the model checkpoint to evaluate 72 | ------------------------------------------------------------------------------- 73 | assert(string.len(opt.model) > 0, 'must provide a model') 74 | local checkpoint = torch.load(opt.model) 75 | -- override and collect parameters 76 | if opt.batch_size == 0 then opt.batch_size = checkpoint.opt.batch_size end 77 | local fetch = {'rnn_size', 'input_encoding_size', 'drop_prob_lm', 'cnn_proto', 'cnn_model', 'seq_per_img'} 78 | for k,v in pairs(fetch) do 79 | opt[v] = checkpoint.opt[v] -- copy over options from model 80 | end 81 | local vocab = checkpoint.vocab -- ix -> word mapping 82 | 83 | ------------------------------------------------------------------------------- 84 | -- Load the networks from model checkpoint 85 | ------------------------------------------------------------------------------- 86 | local protos = checkpoint.protos 87 | protos.expander = nn.FeatExpander(opt.seq_per_img) 88 | protos.lm:createClones() -- reconstruct clones inside the language model 89 | if opt.gpuid >= 0 then for k,v in pairs(protos) do v:cuda() end end 90 | 91 | ------------------------------------------------------------------------------- 92 | -- Evaluation fun(ction) 93 | ------------------------------------------------------------------------------- 94 | 95 | local function run() 96 | protos.cnn:evaluate() 97 | protos.lm:evaluate() 98 | 99 | while true do 100 | local w = frame:size(2) 101 | local h = frame:size(1) 102 | 103 | -- take a central crop 104 | local crop = cv.getRectSubPix{image=frame, patchSize={h,h}, center={w/2, h/2}} 105 | local cropsc = cv.resize{src=crop, dsize={256,256}} 106 | -- BGR2RGB 107 | cropsc = cropsc:index(3,torch.LongTensor{3,2,1}) 108 | -- HWC2CHW 109 | cropsc = cropsc:permute(3,1,2) 110 | 111 | -- fetch a batch of data 112 | local batch = cropsc:contiguous():view(1,3,256,256) 113 | local batch_processed = net_utils.prepro(batch, false, opt.gpuid >= 0) -- preprocess in place, and don't augment 114 | 115 | -- forward the model to get loss 116 | local feats = protos.cnn:forward(batch_processed) 117 | 118 | -- forward the model to also get generated samples for each image 119 | local sample_opts = { sample_max = opt.sample_max, beam_size = opt.beam_size, temperature = opt.temperature } 120 | local seq = protos.lm:sample(feats, sample_opts) 121 | local sents = net_utils.decode_sequence(vocab, seq) 122 | 123 | print(sents[1]) 124 | 125 | cv.putText{ 126 | img=crop, 127 | text = sents[1], 128 | org={10,20}, 129 | fontFace=cv.FONT_HERSHEY_DUPLEX, 130 | fontScale=0.5, 131 | color={255, 255, 0}, 132 | thickness=1 133 | } 134 | 135 | cv.imshow{winname="NeuralTalk2", image=crop} 136 | if cv.waitKey{30} >= 0 then break end 137 | 138 | cap:read{image=frame} 139 | end 140 | end 141 | 142 | run() 143 | -------------------------------------------------------------------------------- /vis/imgs/dummy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karpathy/neuraltalk2/bd8c9d879f957e1218a8f9e1f9b663ac70375866/vis/imgs/dummy -------------------------------------------------------------------------------- /vis/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 5 | 6 |t |