├── 0_Dataset.ipynb
├── 1_Preliminaries.ipynb
├── 2_Training (2).ipynb
├── 3_Inference (2).ipynb
├── Images
    ├── decoder.png
    ├── download_ex.png
    ├── encoder-decoder.png.crdownload
    └── image
├── LICENSE
├── README.md
├── data_loader.py
├── model (2).py
└── vocabulary.py


/0_Dataset.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Computer Vision Nanodegree\n",
  8 |     "\n",
  9 |     "## Project: Image Captioning\n",
 10 |     "\n",
 11 |     "---\n",
 12 |     "\n",
 13 |     "The Microsoft **C**ommon **O**bjects in **CO**ntext (MS COCO) dataset is a large-scale dataset for scene understanding.  The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.  \n",
 14 |     "\n",
 15 |     "![Sample Dog Output](images/coco-examples.jpg)\n",
 16 |     "\n",
 17 |     "You can read more about the dataset on the [website](http://cocodataset.org/#home) or in the [research paper](https://arxiv.org/pdf/1405.0312.pdf).\n",
 18 |     "\n",
 19 |     "In this notebook, you will explore this dataset, in preparation for the project.\n",
 20 |     "\n",
 21 |     "## Step 1: Initialize the COCO API\n",
 22 |     "\n",
 23 |     "We begin by initializing the [COCO API](https://github.com/cocodataset/cocoapi) that you will use to obtain the data."
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": null,
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "import os\n",
 33 |     "import sys\n",
 34 |     "sys.path.append('/opt/cocoapi/PythonAPI')\n",
 35 |     "from pycocotools.coco import COCO\n",
 36 |     "\n",
 37 |     "# initialize COCO API for instance annotations\n",
 38 |     "dataDir = '/opt/cocoapi'\n",
 39 |     "dataType = 'val2014'\n",
 40 |     "instances_annFile = os.path.join(dataDir, 'annotations/instances_{}.json'.format(dataType))\n",
 41 |     "coco = COCO(instances_annFile)\n",
 42 |     "\n",
 43 |     "# initialize COCO API for caption annotations\n",
 44 |     "captions_annFile = os.path.join(dataDir, 'annotations/captions_{}.json'.format(dataType))\n",
 45 |     "coco_caps = COCO(captions_annFile)\n",
 46 |     "\n",
 47 |     "# get image ids \n",
 48 |     "ids = list(coco.anns.keys())"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "## Step 2: Plot a Sample Image\n",
 56 |     "\n",
 57 |     "Next, we plot a random image from the dataset, along with its five corresponding captions.  Each time you run the code cell below, a different image is selected.  \n",
 58 |     "\n",
 59 |     "In the project, you will use this dataset to train your own model to generate captions from images!"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "import numpy as np\n",
 69 |     "import skimage.io as io\n",
 70 |     "import matplotlib.pyplot as plt\n",
 71 |     "%matplotlib inline\n",
 72 |     "\n",
 73 |     "# pick a random image and obtain the corresponding URL\n",
 74 |     "ann_id = np.random.choice(ids)\n",
 75 |     "img_id = coco.anns[ann_id]['image_id']\n",
 76 |     "img = coco.loadImgs(img_id)[0]\n",
 77 |     "url = img['coco_url']\n",
 78 |     "\n",
 79 |     "# print URL and visualize corresponding image\n",
 80 |     "print(url)\n",
 81 |     "I = io.imread(url)\n",
 82 |     "plt.axis('off')\n",
 83 |     "plt.imshow(I)\n",
 84 |     "plt.show()\n",
 85 |     "\n",
 86 |     "# load and display captions\n",
 87 |     "annIds = coco_caps.getAnnIds(imgIds=img['id']);\n",
 88 |     "anns = coco_caps.loadAnns(annIds)\n",
 89 |     "coco_caps.showAnns(anns)"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "## Step 3: What's to Come!\n",
 97 |     "\n",
 98 |     "In this project, you will use the dataset of image-caption pairs to train a CNN-RNN model to automatically generate images from captions.  You'll learn more about how to design the architecture in the next notebook in the sequence (**1_Preliminaries.ipynb**).\n",
 99 |     "\n",
100 |     "![Image Captioning CNN-RNN model](images/encoder-decoder.png)"
101 |    ]
102 |   }
103 |  ],
104 |  "metadata": {
105 |   "anaconda-cloud": {},
106 |   "kernelspec": {
107 |    "display_name": "Python 3",
108 |    "language": "python",
109 |    "name": "python3"
110 |   },
111 |   "language_info": {
112 |    "codemirror_mode": {
113 |     "name": "ipython",
114 |     "version": 3
115 |    },
116 |    "file_extension": ".py",
117 |    "mimetype": "text/x-python",
118 |    "name": "python",
119 |    "nbconvert_exporter": "python",
120 |    "pygments_lexer": "ipython3",
121 |    "version": "3.6.3"
122 |   }
123 |  },
124 |  "nbformat": 4,
125 |  "nbformat_minor": 2
126 | }
127 | 


--------------------------------------------------------------------------------
/1_Preliminaries.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Computer Vision Nanodegree\n",
  8 |     "\n",
  9 |     "## Project: Image Captioning\n",
 10 |     "\n",
 11 |     "---\n",
 12 |     "\n",
 13 |     "In this notebook, you will learn how to load and pre-process data from the [COCO dataset](http://cocodataset.org/#home). You will also design a CNN-RNN model for automatically generating image captions.\n",
 14 |     "\n",
 15 |     "Note that **any amendments that you make to this notebook will not be graded**.  However, you will use the instructions provided in **Step 3** and **Step 4** to implement your own CNN encoder and RNN decoder by making amendments to the **models.py** file provided as part of this project.  Your **models.py** file **will be graded**. \n",
 16 |     "\n",
 17 |     "Feel free to use the links below to navigate the notebook:\n",
 18 |     "- [Step 1](#step1): Explore the Data Loader\n",
 19 |     "- [Step 2](#step2): Use the Data Loader to Obtain Batches\n",
 20 |     "- [Step 3](#step3): Experiment with the CNN Encoder\n",
 21 |     "- [Step 4](#step4): Implement the RNN Decoder"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "<a id='step1'></a>\n",
 29 |     "## Step 1: Explore the Data Loader\n",
 30 |     "\n",
 31 |     "We have already written a [data loader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) that you can use to load the COCO dataset in batches. \n",
 32 |     "\n",
 33 |     "In the code cell below, you will initialize the data loader by using the `get_loader` function in **data_loader.py**.  \n",
 34 |     "\n",
 35 |     "> For this project, you are not permitted to change the **data_loader.py** file, which must be used as-is.\n",
 36 |     "\n",
 37 |     "The `get_loader` function takes as input a number of arguments that can be explored in **data_loader.py**.  Take the time to explore these arguments now by opening **data_loader.py** in a new window.  Most of the arguments must be left at their default values, and you are only allowed to amend the values of the arguments below:\n",
 38 |     "1. **`transform`** - an [image transform](http://pytorch.org/docs/master/torchvision/transforms.html) specifying how to pre-process the images and convert them to PyTorch tensors before using them as input to the CNN encoder.  For now, you are encouraged to keep the transform as provided in `transform_train`.  You will have the opportunity later to choose your own image transform to pre-process the COCO images.\n",
 39 |     "2. **`mode`** - one of `'train'` (loads the training data in batches) or `'test'` (for the test data). We will say that the data loader is in training or test mode, respectively.  While following the instructions in this notebook, please keep the data loader in training mode by setting `mode='train'`.\n",
 40 |     "3. **`batch_size`** - determines the batch size.  When training the model, this is number of image-caption pairs used to amend the model weights in each training step.\n",
 41 |     "4. **`vocab_threshold`** - the total number of times that a word must appear in the in the training captions before it is used as part of the vocabulary.  Words that have fewer than `vocab_threshold` occurrences in the training captions are considered unknown words. \n",
 42 |     "5. **`vocab_from_file`** - a Boolean that decides whether to load the vocabulary from file.  \n",
 43 |     "\n",
 44 |     "We will describe the `vocab_threshold` and `vocab_from_file` arguments in more detail soon.  For now, run the code cell below.  Be patient - it may take a couple of minutes to run!"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 1,
 50 |    "metadata": {
 51 |     "scrolled": false
 52 |    },
 53 |    "outputs": [
 54 |     {
 55 |      "name": "stdout",
 56 |      "output_type": "stream",
 57 |      "text": [
 58 |       "Requirement already satisfied: nltk in /opt/conda/lib/python3.6/site-packages\n",
 59 |       "Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from nltk)\n",
 60 |       "\u001b[33mYou are using pip version 9.0.1, however version 18.1 is available.\n",
 61 |       "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n",
 62 |       "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
 63 |       "[nltk_data]   Unzipping tokenizers/punkt.zip.\n",
 64 |       "loading annotations into memory...\n",
 65 |       "Done (t=0.88s)\n",
 66 |       "creating index...\n",
 67 |       "index created!\n",
 68 |       "[0/414113] Tokenizing captions...\n",
 69 |       "[100000/414113] Tokenizing captions...\n",
 70 |       "[200000/414113] Tokenizing captions...\n",
 71 |       "[300000/414113] Tokenizing captions...\n",
 72 |       "[400000/414113] Tokenizing captions...\n",
 73 |       "loading annotations into memory...\n",
 74 |       "Done (t=0.95s)\n",
 75 |       "creating index...\n"
 76 |      ]
 77 |     },
 78 |     {
 79 |      "name": "stderr",
 80 |      "output_type": "stream",
 81 |      "text": [
 82 |       "  0%|          | 1201/414113 [00:00<01:08, 6007.85it/s]"
 83 |      ]
 84 |     },
 85 |     {
 86 |      "name": "stdout",
 87 |      "output_type": "stream",
 88 |      "text": [
 89 |       "index created!\n",
 90 |       "Obtaining caption lengths...\n"
 91 |      ]
 92 |     },
 93 |     {
 94 |      "name": "stderr",
 95 |      "output_type": "stream",
 96 |      "text": [
 97 |       "100%|██████████| 414113/414113 [01:11<00:00, 5800.50it/s]\n"
 98 |      ]
 99 |     }
100 |    ],
101 |    "source": [
102 |     "import sys\n",
103 |     "sys.path.append('/opt/cocoapi/PythonAPI')\n",
104 |     "from pycocotools.coco import COCO\n",
105 |     "!pip install nltk\n",
106 |     "import nltk\n",
107 |     "nltk.download('punkt')\n",
108 |     "from data_loader import get_loader\n",
109 |     "from torchvision import transforms\n",
110 |     "\n",
111 |     "# Define a transform to pre-process the training images.\n",
112 |     "transform_train = transforms.Compose([ \n",
113 |     "    transforms.Resize(256),                          # smaller edge of image resized to 256\n",
114 |     "    transforms.RandomCrop(224),                      # get 224x224 crop from random location\n",
115 |     "    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5\n",
116 |     "    transforms.ToTensor(),                           # convert the PIL Image to a tensor\n",
117 |     "    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model\n",
118 |     "                         (0.229, 0.224, 0.225))])\n",
119 |     "\n",
120 |     "# Set the minimum word count threshold.\n",
121 |     "vocab_threshold = 5\n",
122 |     "\n",
123 |     "# Specify the batch size.\n",
124 |     "batch_size = 10\n",
125 |     "\n",
126 |     "# Obtain the data loader.\n",
127 |     "data_loader = get_loader(transform=transform_train,\n",
128 |     "                         mode='train',\n",
129 |     "                         batch_size=batch_size,\n",
130 |     "                         vocab_threshold=vocab_threshold,\n",
131 |     "                         vocab_from_file=False)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "When you ran the code cell above, the data loader was stored in the variable `data_loader`.  \n",
139 |     "\n",
140 |     "You can access the corresponding dataset as `data_loader.dataset`.  This dataset is an instance of the `CoCoDataset` class in **data_loader.py**.  If you are unfamiliar with data loaders and datasets, you are encouraged to review [this PyTorch tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html).\n",
141 |     "\n",
142 |     "### Exploring the `__getitem__` Method\n",
143 |     "\n",
144 |     "The `__getitem__` method in the `CoCoDataset` class determines how an image-caption pair is pre-processed before being incorporated into a batch.  This is true for all `Dataset` classes in PyTorch; if this is unfamiliar to you, please review [the tutorial linked above](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html). \n",
145 |     "\n",
146 |     "When the data loader is in training mode, this method begins by first obtaining the filename (`path`) of a training image and its corresponding caption (`caption`).\n",
147 |     "\n",
148 |     "#### Image Pre-Processing \n",
149 |     "\n",
150 |     "Image pre-processing is relatively straightforward (from the `__getitem__` method in the `CoCoDataset` class):\n",
151 |     "```python\n",
152 |     "# Convert image to tensor and pre-process using transform\n",
153 |     "image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')\n",
154 |     "image = self.transform(image)\n",
155 |     "```\n",
156 |     "After loading the image in the training folder with name `path`, the image is pre-processed using the same transform (`transform_train`) that was supplied when instantiating the data loader.  \n",
157 |     "\n",
158 |     "#### Caption Pre-Processing \n",
159 |     "\n",
160 |     "The captions also need to be pre-processed and prepped for training. In this example, for generating captions, we are aiming to create a model that predicts the next token of a sentence from previous tokens, so we turn the caption associated with any image into a list of tokenized words, before casting it to a PyTorch tensor that we can use to train the network.\n",
161 |     "\n",
162 |     "To understand in more detail how COCO captions are pre-processed, we'll first need to take a look at the `vocab` instance variable of the `CoCoDataset` class.  The code snippet below is pulled from the `__init__` method of the `CoCoDataset` class:\n",
163 |     "```python\n",
164 |     "def __init__(self, transform, mode, batch_size, vocab_threshold, vocab_file, start_word, \n",
165 |     "        end_word, unk_word, annotations_file, vocab_from_file, img_folder):\n",
166 |     "        ...\n",
167 |     "        self.vocab = Vocabulary(vocab_threshold, vocab_file, start_word,\n",
168 |     "            end_word, unk_word, annotations_file, vocab_from_file)\n",
169 |     "        ...\n",
170 |     "```\n",
171 |     "From the code snippet above, you can see that `data_loader.dataset.vocab` is an instance of the `Vocabulary` class from **vocabulary.py**.  Take the time now to verify this for yourself by looking at the full code in **data_loader.py**.  \n",
172 |     "\n",
173 |     "We use this instance to pre-process the COCO captions (from the `__getitem__` method in the `CoCoDataset` class):\n",
174 |     "\n",
175 |     "```python\n",
176 |     "# Convert caption to tensor of word ids.\n",
177 |     "tokens = nltk.tokenize.word_tokenize(str(caption).lower())   # line 1\n",
178 |     "caption = []                                                 # line 2\n",
179 |     "caption.append(self.vocab(self.vocab.start_word))            # line 3\n",
180 |     "caption.extend([self.vocab(token) for token in tokens])      # line 4\n",
181 |     "caption.append(self.vocab(self.vocab.end_word))              # line 5\n",
182 |     "caption = torch.Tensor(caption).long()                       # line 6\n",
183 |     "```\n",
184 |     "\n",
185 |     "As you will see soon, this code converts any string-valued caption to a list of integers, before casting it to a PyTorch tensor.  To see how this code works, we'll apply it to the sample caption in the next code cell."
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": 2,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "sample_caption = 'A person doing a trick on a rail while riding a skateboard.'"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "metadata": {},
200 |    "source": [
201 |     "In **`line 1`** of the code snippet, every letter in the caption is converted to lowercase, and the [`nltk.tokenize.word_tokenize`](http://www.nltk.org/) function is used to obtain a list of string-valued tokens.  Run the next code cell to visualize the effect on `sample_caption`."
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 3,
207 |    "metadata": {},
208 |    "outputs": [
209 |     {
210 |      "name": "stdout",
211 |      "output_type": "stream",
212 |      "text": [
213 |       "['a', 'person', 'doing', 'a', 'trick', 'on', 'a', 'rail', 'while', 'riding', 'a', 'skateboard', '.']\n"
214 |      ]
215 |     }
216 |    ],
217 |    "source": [
218 |     "import nltk\n",
219 |     "\n",
220 |     "sample_tokens = nltk.tokenize.word_tokenize(str(sample_caption).lower())\n",
221 |     "print(sample_tokens)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "markdown",
226 |    "metadata": {},
227 |    "source": [
228 |     "In **`line 2`** and **`line 3`** we initialize an empty list and append an integer to mark the start of a caption.  The [paper](https://arxiv.org/pdf/1411.4555.pdf) that you are encouraged to implement uses a special start word (and a special end word, which we'll examine below) to mark the beginning (and end) of a caption.\n",
229 |     "\n",
230 |     "This special start word (`\"<start>\"`) is decided when instantiating the data loader and is passed as a parameter (`start_word`).  You are **required** to keep this parameter at its default value (`start_word=\"<start>\"`).\n",
231 |     "\n",
232 |     "As you will see below, the integer `0` is always used to mark the start of a caption."
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 4,
238 |    "metadata": {},
239 |    "outputs": [
240 |     {
241 |      "name": "stdout",
242 |      "output_type": "stream",
243 |      "text": [
244 |       "Special start word: <start>\n",
245 |       "[0]\n"
246 |      ]
247 |     }
248 |    ],
249 |    "source": [
250 |     "sample_caption = []\n",
251 |     "\n",
252 |     "start_word = data_loader.dataset.vocab.start_word\n",
253 |     "print('Special start word:', start_word)\n",
254 |     "sample_caption.append(data_loader.dataset.vocab(start_word))\n",
255 |     "print(sample_caption)"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "In **`line 4`**, we continue the list by adding integers that correspond to each of the tokens in the caption."
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 5,
268 |    "metadata": {},
269 |    "outputs": [
270 |     {
271 |      "name": "stdout",
272 |      "output_type": "stream",
273 |      "text": [
274 |       "[0, 3, 98, 754, 3, 396, 39, 3, 1009, 207, 139, 3, 753, 18]\n"
275 |      ]
276 |     }
277 |    ],
278 |    "source": [
279 |     "sample_caption.extend([data_loader.dataset.vocab(token) for token in sample_tokens])\n",
280 |     "print(sample_caption)"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "metadata": {},
286 |    "source": [
287 |     "In **`line 5`**, we append a final integer to mark the end of the caption.  \n",
288 |     "\n",
289 |     "Identical to the case of the special start word (above), the special end word (`\"<end>\"`) is decided when instantiating the data loader and is passed as a parameter (`end_word`).  You are **required** to keep this parameter at its default value (`end_word=\"<end>\"`).\n",
290 |     "\n",
291 |     "As you will see below, the integer `1` is always used to  mark the end of a caption."
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": 6,
297 |    "metadata": {},
298 |    "outputs": [
299 |     {
300 |      "name": "stdout",
301 |      "output_type": "stream",
302 |      "text": [
303 |       "Special end word: <end>\n",
304 |       "[0, 3, 98, 754, 3, 396, 39, 3, 1009, 207, 139, 3, 753, 18, 1]\n"
305 |      ]
306 |     }
307 |    ],
308 |    "source": [
309 |     "end_word = data_loader.dataset.vocab.end_word\n",
310 |     "print('Special end word:', end_word)\n",
311 |     "\n",
312 |     "sample_caption.append(data_loader.dataset.vocab(end_word))\n",
313 |     "print(sample_caption)"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "metadata": {},
319 |    "source": [
320 |     "Finally, in **`line 6`**, we convert the list of integers to a PyTorch tensor and cast it to [long type](http://pytorch.org/docs/master/tensors.html#torch.Tensor.long).  You can read more about the different types of PyTorch tensors on the [website](http://pytorch.org/docs/master/tensors.html)."
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": 7,
326 |    "metadata": {},
327 |    "outputs": [
328 |     {
329 |      "name": "stdout",
330 |      "output_type": "stream",
331 |      "text": [
332 |       "tensor([    0,     3,    98,   754,     3,   396,    39,     3,  1009,\n",
333 |       "          207,   139,     3,   753,    18,     1])\n"
334 |      ]
335 |     }
336 |    ],
337 |    "source": [
338 |     "import torch\n",
339 |     "\n",
340 |     "sample_caption = torch.Tensor(sample_caption).long()\n",
341 |     "print(sample_caption)"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "markdown",
346 |    "metadata": {},
347 |    "source": [
348 |     "And that's it!  In summary, any caption is converted to a list of tokens, with _special_ start and end tokens marking the beginning and end of the sentence:\n",
349 |     "```\n",
350 |     "[<start>, 'a', 'person', 'doing', 'a', 'trick', 'while', 'riding', 'a', 'skateboard', '.', <end>]\n",
351 |     "```\n",
352 |     "This list of tokens is then turned into a list of integers, where every distinct word in the vocabulary has an associated integer value:\n",
353 |     "```\n",
354 |     "[0, 3, 98, 754, 3, 396, 207, 139, 3, 753, 18, 1]\n",
355 |     "```\n",
356 |     "Finally, this list is converted to a PyTorch tensor.  All of the captions in the COCO dataset are pre-processed using this same procedure from **`lines 1-6`** described above.  \n",
357 |     "\n",
358 |     "As you saw, in order to convert a token to its corresponding integer, we call `data_loader.dataset.vocab` as a function.  The details of how this call works can be explored in the `__call__` method in the `Vocabulary` class in **vocabulary.py**.  \n",
359 |     "\n",
360 |     "```python\n",
361 |     "def __call__(self, word):\n",
362 |     "    if not word in self.word2idx:\n",
363 |     "        return self.word2idx[self.unk_word]\n",
364 |     "    return self.word2idx[word]\n",
365 |     "```\n",
366 |     "\n",
367 |     "The `word2idx` instance variable is a Python [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) that is indexed by string-valued keys (mostly tokens obtained from training captions).  For each key, the corresponding value is the integer that the token is mapped to in the pre-processing step.\n",
368 |     "\n",
369 |     "Use the code cell below to view a subset of this dictionary."
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": 8,
375 |    "metadata": {},
376 |    "outputs": [
377 |     {
378 |      "data": {
379 |       "text/plain": [
380 |        "{'<start>': 0,\n",
381 |        " '<end>': 1,\n",
382 |        " '<unk>': 2,\n",
383 |        " 'a': 3,\n",
384 |        " 'very': 4,\n",
385 |        " 'clean': 5,\n",
386 |        " 'and': 6,\n",
387 |        " 'well': 7,\n",
388 |        " 'decorated': 8,\n",
389 |        " 'empty': 9}"
390 |       ]
391 |      },
392 |      "execution_count": 8,
393 |      "metadata": {},
394 |      "output_type": "execute_result"
395 |     }
396 |    ],
397 |    "source": [
398 |     "# Preview the word2idx dictionary.\n",
399 |     "dict(list(data_loader.dataset.vocab.word2idx.items())[:10])"
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "markdown",
404 |    "metadata": {},
405 |    "source": [
406 |     "We also print the total number of keys."
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "code",
411 |    "execution_count": 9,
412 |    "metadata": {},
413 |    "outputs": [
414 |     {
415 |      "name": "stdout",
416 |      "output_type": "stream",
417 |      "text": [
418 |       "Total number of tokens in vocabulary: 8855\n"
419 |      ]
420 |     }
421 |    ],
422 |    "source": [
423 |     "# Print the total number of keys in the word2idx dictionary.\n",
424 |     "print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "markdown",
429 |    "metadata": {},
430 |    "source": [
431 |     "As you will see if you examine the code in **vocabulary.py**, the `word2idx` dictionary is created by looping over the captions in the training dataset.  If a token appears no less than `vocab_threshold` times in the training set, then it is added as a key to the dictionary and assigned a corresponding unique integer.  You will have the option later to amend the `vocab_threshold` argument when instantiating your data loader.  Note that in general, **smaller** values for `vocab_threshold` yield a **larger** number of tokens in the vocabulary.  You are encouraged to check this for yourself in the next code cell by decreasing the value of `vocab_threshold` before creating a new data loader.  "
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": 10,
437 |    "metadata": {},
438 |    "outputs": [
439 |     {
440 |      "name": "stdout",
441 |      "output_type": "stream",
442 |      "text": [
443 |       "loading annotations into memory...\n",
444 |       "Done (t=0.88s)\n",
445 |       "creating index...\n",
446 |       "index created!\n",
447 |       "[0/414113] Tokenizing captions...\n",
448 |       "[100000/414113] Tokenizing captions...\n",
449 |       "[200000/414113] Tokenizing captions...\n",
450 |       "[300000/414113] Tokenizing captions...\n",
451 |       "[400000/414113] Tokenizing captions...\n",
452 |       "loading annotations into memory...\n"
453 |      ]
454 |     },
455 |     {
456 |      "name": "stderr",
457 |      "output_type": "stream",
458 |      "text": [
459 |       "  0%|          | 1177/414113 [00:00<01:09, 5933.36it/s]"
460 |      ]
461 |     },
462 |     {
463 |      "name": "stdout",
464 |      "output_type": "stream",
465 |      "text": [
466 |       "Done (t=0.87s)\n",
467 |       "creating index...\n",
468 |       "index created!\n",
469 |       "Obtaining caption lengths...\n"
470 |      ]
471 |     },
472 |     {
473 |      "name": "stderr",
474 |      "output_type": "stream",
475 |      "text": [
476 |       "100%|██████████| 414113/414113 [01:12<00:00, 5724.08it/s]\n"
477 |      ]
478 |     }
479 |    ],
480 |    "source": [
481 |     "# Modify the minimum word count threshold.\n",
482 |     "vocab_threshold = 4\n",
483 |     "\n",
484 |     "# Obtain the data loader.\n",
485 |     "data_loader = get_loader(transform=transform_train,\n",
486 |     "                         mode='train',\n",
487 |     "                         batch_size=batch_size,\n",
488 |     "                         vocab_threshold=vocab_threshold,\n",
489 |     "                         vocab_from_file=False)"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "execution_count": 11,
495 |    "metadata": {},
496 |    "outputs": [
497 |     {
498 |      "name": "stdout",
499 |      "output_type": "stream",
500 |      "text": [
501 |       "Total number of tokens in vocabulary: 9955\n"
502 |      ]
503 |     }
504 |    ],
505 |    "source": [
506 |     "# Print the total number of keys in the word2idx dictionary.\n",
507 |     "print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))"
508 |    ]
509 |   },
510 |   {
511 |    "cell_type": "markdown",
512 |    "metadata": {},
513 |    "source": [
514 |     "There are also a few special keys in the `word2idx` dictionary.  You are already familiar with the special start word (`\"<start>\"`) and special end word (`\"<end>\"`).  There is one more special token, corresponding to unknown words (`\"<unk>\"`).  All tokens that don't appear anywhere in the `word2idx` dictionary are considered unknown words.  In the pre-processing step, any unknown tokens are mapped to the integer `2`."
515 |    ]
516 |   },
517 |   {
518 |    "cell_type": "code",
519 |    "execution_count": 12,
520 |    "metadata": {},
521 |    "outputs": [
522 |     {
523 |      "name": "stdout",
524 |      "output_type": "stream",
525 |      "text": [
526 |       "Special unknown word: <unk>\n",
527 |       "All unknown words are mapped to this integer: 2\n"
528 |      ]
529 |     }
530 |    ],
531 |    "source": [
532 |     "unk_word = data_loader.dataset.vocab.unk_word\n",
533 |     "print('Special unknown word:', unk_word)\n",
534 |     "\n",
535 |     "print('All unknown words are mapped to this integer:', data_loader.dataset.vocab(unk_word))"
536 |    ]
537 |   },
538 |   {
539 |    "cell_type": "markdown",
540 |    "metadata": {},
541 |    "source": [
542 |     "Check this for yourself below, by pre-processing the provided nonsense words that never appear in the training captions. "
543 |    ]
544 |   },
545 |   {
546 |    "cell_type": "code",
547 |    "execution_count": 13,
548 |    "metadata": {},
549 |    "outputs": [
550 |     {
551 |      "name": "stdout",
552 |      "output_type": "stream",
553 |      "text": [
554 |       "2\n",
555 |       "2\n"
556 |      ]
557 |     }
558 |    ],
559 |    "source": [
560 |     "print(data_loader.dataset.vocab('jfkafejw'))\n",
561 |     "print(data_loader.dataset.vocab('ieowoqjf'))"
562 |    ]
563 |   },
564 |   {
565 |    "cell_type": "markdown",
566 |    "metadata": {},
567 |    "source": [
568 |     "The final thing to mention is the `vocab_from_file` argument that is supplied when creating a data loader.  To understand this argument, note that when you create a new data loader, the vocabulary (`data_loader.dataset.vocab`) is saved as a [pickle](https://docs.python.org/3/library/pickle.html) file in the project folder, with filename `vocab.pkl`.\n",
569 |     "\n",
570 |     "If you are still tweaking the value of the `vocab_threshold` argument, you **must** set `vocab_from_file=False` to have your changes take effect.  \n",
571 |     "\n",
572 |     "But once you are happy with the value that you have chosen for the `vocab_threshold` argument, you need only run the data loader *one more time* with your chosen `vocab_threshold` to save the new vocabulary to file.  Then, you can henceforth set `vocab_from_file=True` to load the vocabulary from file and speed the instantiation of the data loader.  Note that building the vocabulary from scratch is the most time-consuming part of instantiating the data loader, and so you are strongly encouraged to set `vocab_from_file=True` as soon as you are able.\n",
573 |     "\n",
574 |     "Note that if `vocab_from_file=True`, then any supplied argument for `vocab_threshold` when instantiating the data loader is completely ignored."
575 |    ]
576 |   },
577 |   {
578 |    "cell_type": "code",
579 |    "execution_count": 14,
580 |    "metadata": {},
581 |    "outputs": [
582 |     {
583 |      "name": "stdout",
584 |      "output_type": "stream",
585 |      "text": [
586 |       "Vocabulary successfully loaded from vocab.pkl file!\n",
587 |       "loading annotations into memory...\n"
588 |      ]
589 |     },
590 |     {
591 |      "name": "stderr",
592 |      "output_type": "stream",
593 |      "text": [
594 |       "100%|██████████| 414113/414113 [01:11<00:00, 5752.71it/s]\n"
595 |      ]
596 |     },
597 |     {
598 |      "name": "stdout",
599 |      "output_type": "stream",
600 |      "text": [
601 |       "Done (t=0.93s)\n",
602 |       "creating index...\n",
603 |       "index created!\n",
604 |       "Obtaining caption lengths...\n"
605 |      ]
606 |     }
607 |    ],
608 |    "source": [
609 |     "# Obtain the data loader (from file). Note that it runs much faster than before!\n",
610 |     "data_loader = get_loader(transform=transform_train,\n",
611 |     "                         mode='train',\n",
612 |     "                         batch_size=batch_size,\n",
613 |     "                         vocab_from_file=True)"
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "markdown",
618 |    "metadata": {},
619 |    "source": [
620 |     "In the next section, you will learn how to use the data loader to obtain batches of training data."
621 |    ]
622 |   },
623 |   {
624 |    "cell_type": "markdown",
625 |    "metadata": {},
626 |    "source": [
627 |     "<a id='step2'></a>\n",
628 |     "## Step 2: Use the Data Loader to Obtain Batches\n",
629 |     "\n",
630 |     "The captions in the dataset vary greatly in length.  You can see this by examining `data_loader.dataset.caption_lengths`, a Python list with one entry for each training caption (where the value stores the length of the corresponding caption).  \n",
631 |     "\n",
632 |     "In the code cell below, we use this list to print the total number of captions in the training data with each length.  As you will see below, the majority of captions have length 10.  Likewise, very short and very long captions are quite rare.  "
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "code",
637 |    "execution_count": 15,
638 |    "metadata": {},
639 |    "outputs": [
640 |     {
641 |      "name": "stdout",
642 |      "output_type": "stream",
643 |      "text": [
644 |       "value: 10 --- count: 86334\n",
645 |       "value: 11 --- count: 79948\n",
646 |       "value:  9 --- count: 71934\n",
647 |       "value: 12 --- count: 57637\n",
648 |       "value: 13 --- count: 37645\n",
649 |       "value: 14 --- count: 22335\n",
650 |       "value:  8 --- count: 20771\n",
651 |       "value: 15 --- count: 12841\n",
652 |       "value: 16 --- count:  7729\n",
653 |       "value: 17 --- count:  4842\n",
654 |       "value: 18 --- count:  3104\n",
655 |       "value: 19 --- count:  2014\n",
656 |       "value:  7 --- count:  1597\n",
657 |       "value: 20 --- count:  1451\n",
658 |       "value: 21 --- count:   999\n",
659 |       "value: 22 --- count:   683\n",
660 |       "value: 23 --- count:   534\n",
661 |       "value: 24 --- count:   383\n",
662 |       "value: 25 --- count:   277\n",
663 |       "value: 26 --- count:   215\n",
664 |       "value: 27 --- count:   159\n",
665 |       "value: 28 --- count:   115\n",
666 |       "value: 29 --- count:    86\n",
667 |       "value: 30 --- count:    58\n",
668 |       "value: 31 --- count:    49\n",
669 |       "value: 32 --- count:    44\n",
670 |       "value: 34 --- count:    39\n",
671 |       "value: 37 --- count:    32\n",
672 |       "value: 33 --- count:    31\n",
673 |       "value: 35 --- count:    31\n",
674 |       "value: 36 --- count:    26\n",
675 |       "value: 38 --- count:    18\n",
676 |       "value: 39 --- count:    18\n",
677 |       "value: 43 --- count:    16\n",
678 |       "value: 44 --- count:    16\n",
679 |       "value: 48 --- count:    12\n",
680 |       "value: 45 --- count:    11\n",
681 |       "value: 42 --- count:    10\n",
682 |       "value: 40 --- count:     9\n",
683 |       "value: 49 --- count:     9\n",
684 |       "value: 46 --- count:     9\n",
685 |       "value: 47 --- count:     7\n",
686 |       "value: 50 --- count:     6\n",
687 |       "value: 51 --- count:     6\n",
688 |       "value: 41 --- count:     6\n",
689 |       "value: 52 --- count:     5\n",
690 |       "value: 54 --- count:     3\n",
691 |       "value: 56 --- count:     2\n",
692 |       "value:  6 --- count:     2\n",
693 |       "value: 53 --- count:     2\n",
694 |       "value: 55 --- count:     2\n",
695 |       "value: 57 --- count:     1\n"
696 |      ]
697 |     }
698 |    ],
699 |    "source": [
700 |     "from collections import Counter\n",
701 |     "\n",
702 |     "# Tally the total number of training captions with each length.\n",
703 |     "counter = Counter(data_loader.dataset.caption_lengths)\n",
704 |     "lengths = sorted(counter.items(), key=lambda pair: pair[1], reverse=True)\n",
705 |     "for value, count in lengths:\n",
706 |     "    print('value: %2d --- count: %5d' % (value, count))"
707 |    ]
708 |   },
709 |   {
710 |    "cell_type": "markdown",
711 |    "metadata": {},
712 |    "source": [
713 |     "To generate batches of training data, we begin by first sampling a caption length (where the probability that any length is drawn is proportional to the number of captions with that length in the dataset).  Then, we retrieve a batch of size `batch_size` of image-caption pairs, where all captions have the sampled length.  This approach for assembling batches matches the procedure in [this paper](https://arxiv.org/pdf/1502.03044.pdf) and has been shown to be computationally efficient without degrading performance.\n",
714 |     "\n",
715 |     "Run the code cell below to generate a batch.  The `get_train_indices` method in the `CoCoDataset` class first samples a caption length, and then samples `batch_size` indices corresponding to training data points with captions of that length.  These indices are stored below in `indices`.\n",
716 |     "\n",
717 |     "These indices are supplied to the data loader, which then is used to retrieve the corresponding data points.  The pre-processed images and captions in the batch are stored in `images` and `captions`."
718 |    ]
719 |   },
720 |   {
721 |    "cell_type": "code",
722 |    "execution_count": 16,
723 |    "metadata": {
724 |     "scrolled": false
725 |    },
726 |    "outputs": [
727 |     {
728 |      "name": "stdout",
729 |      "output_type": "stream",
730 |      "text": [
731 |       "sampled indices: [218755, 211575, 274482, 18930, 119189, 307903, 116144, 19399, 382457, 188791]\n",
732 |       "images.shape: torch.Size([10, 3, 224, 224])\n",
733 |       "captions.shape: torch.Size([10, 15])\n"
734 |      ]
735 |     }
736 |    ],
737 |    "source": [
738 |     "import numpy as np\n",
739 |     "import torch.utils.data as data\n",
740 |     "\n",
741 |     "# Randomly sample a caption length, and sample indices with that length.\n",
742 |     "indices = data_loader.dataset.get_train_indices()\n",
743 |     "print('sampled indices:', indices)\n",
744 |     "\n",
745 |     "# Create and assign a batch sampler to retrieve a batch with the sampled indices.\n",
746 |     "new_sampler = data.sampler.SubsetRandomSampler(indices=indices)\n",
747 |     "data_loader.batch_sampler.sampler = new_sampler\n",
748 |     "    \n",
749 |     "# Obtain the batch.\n",
750 |     "images, captions = next(iter(data_loader))\n",
751 |     "    \n",
752 |     "print('images.shape:', images.shape)\n",
753 |     "print('captions.shape:', captions.shape)\n",
754 |     "\n",
755 |     "# (Optional) Uncomment the lines of code below to print the pre-processed images and captions.\n",
756 |     "# print('images:', images)\n",
757 |     "# print('captions:', captions)"
758 |    ]
759 |   },
760 |   {
761 |    "cell_type": "markdown",
762 |    "metadata": {},
763 |    "source": [
764 |     "Each time you run the code cell above, a different caption length is sampled, and a different batch of training data is returned.  Run the code cell multiple times to check this out!\n",
765 |     "\n",
766 |     "You will train your model in the next notebook in this sequence (**2_Training.ipynb**). This code for generating training batches will be provided to you.\n",
767 |     "\n",
768 |     "> Before moving to the next notebook in the sequence (**2_Training.ipynb**), you are strongly encouraged to take the time to become very familiar with the code in  **data_loader.py** and **vocabulary.py**.  **Step 1** and **Step 2** of this notebook are designed to help facilitate a basic introduction and guide your understanding.  However, our description is not exhaustive, and it is up to you (as part of the project) to learn how to best utilize these files to complete the project.  __You should NOT amend any of the code in either *data_loader.py* or *vocabulary.py*.__\n",
769 |     "\n",
770 |     "In the next steps, we focus on learning how to specify a CNN-RNN architecture in PyTorch, towards the goal of image captioning."
771 |    ]
772 |   },
773 |   {
774 |    "cell_type": "markdown",
775 |    "metadata": {},
776 |    "source": [
777 |     "<a id='step3'></a>\n",
778 |     "## Step 3: Experiment with the CNN Encoder\n",
779 |     "\n",
780 |     "Run the code cell below to import `EncoderCNN` and `DecoderRNN` from **model.py**. "
781 |    ]
782 |   },
783 |   {
784 |    "cell_type": "code",
785 |    "execution_count": 17,
786 |    "metadata": {},
787 |    "outputs": [],
788 |    "source": [
789 |     "# Watch for any changes in model.py, and re-load it automatically.\n",
790 |     "% load_ext autoreload\n",
791 |     "% autoreload 2\n",
792 |     "\n",
793 |     "# Import EncoderCNN and DecoderRNN. \n",
794 |     "from model import EncoderCNN, DecoderRNN"
795 |    ]
796 |   },
797 |   {
798 |    "cell_type": "markdown",
799 |    "metadata": {},
800 |    "source": [
801 |     "In the next code cell we define a `device` that you will use move PyTorch tensors to GPU (if CUDA is available).  Run this code cell before continuing."
802 |    ]
803 |   },
804 |   {
805 |    "cell_type": "code",
806 |    "execution_count": 18,
807 |    "metadata": {},
808 |    "outputs": [],
809 |    "source": [
810 |     "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
811 |    ]
812 |   },
813 |   {
814 |    "cell_type": "markdown",
815 |    "metadata": {},
816 |    "source": [
817 |     "Run the code cell below to instantiate the CNN encoder in `encoder`.  \n",
818 |     "\n",
819 |     "The pre-processed images from the batch in **Step 2** of this notebook are then passed through the encoder, and the output is stored in `features`."
820 |    ]
821 |   },
822 |   {
823 |    "cell_type": "code",
824 |    "execution_count": 19,
825 |    "metadata": {},
826 |    "outputs": [
827 |     {
828 |      "name": "stderr",
829 |      "output_type": "stream",
830 |      "text": [
831 |       "Downloading: \"https://download.pytorch.org/models/resnet50-19c8e357.pth\" to /root/.torch/models/resnet50-19c8e357.pth\n",
832 |       "100%|██████████| 102502400/102502400 [00:04<00:00, 23112603.41it/s]\n"
833 |      ]
834 |     },
835 |     {
836 |      "name": "stdout",
837 |      "output_type": "stream",
838 |      "text": [
839 |       "type(features): <class 'torch.Tensor'>\n",
840 |       "features.shape: torch.Size([10, 256])\n"
841 |      ]
842 |     }
843 |    ],
844 |    "source": [
845 |     "# Specify the dimensionality of the image embedding.\n",
846 |     "embed_size = 256\n",
847 |     "\n",
848 |     "#-#-#-# Do NOT modify the code below this line. #-#-#-#\n",
849 |     "\n",
850 |     "# Initialize the encoder. (Optional: Add additional arguments if necessary.)\n",
851 |     "encoder = EncoderCNN(embed_size)\n",
852 |     "\n",
853 |     "# Move the encoder to GPU if CUDA is available.\n",
854 |     "encoder.to(device)\n",
855 |     "    \n",
856 |     "# Move last batch of images (from Step 2) to GPU if CUDA is available.   \n",
857 |     "images = images.to(device)\n",
858 |     "\n",
859 |     "# Pass the images through the encoder.\n",
860 |     "features = encoder(images)\n",
861 |     "\n",
862 |     "print('type(features):', type(features))\n",
863 |     "print('features.shape:', features.shape)\n",
864 |     "\n",
865 |     "# Check that your encoder satisfies some requirements of the project! :D\n",
866 |     "assert type(features)==torch.Tensor, \"Encoder output needs to be a PyTorch Tensor.\" \n",
867 |     "assert (features.shape[0]==batch_size) & (features.shape[1]==embed_size), \"The shape of the encoder output is incorrect.\""
868 |    ]
869 |   },
870 |   {
871 |    "cell_type": "markdown",
872 |    "metadata": {},
873 |    "source": [
874 |     "The encoder that we provide to you uses the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images.  The output is then flattened to a vector, before being passed through a `Linear` layer to transform the feature vector to have the same size as the word embedding.\n",
875 |     "\n",
876 |     "![Encoder](images/encoder.png)\n",
877 |     "\n",
878 |     "You are welcome (and encouraged) to amend the encoder in **model.py**, to experiment with other architectures.  In particular, consider using a [different pre-trained model architecture](http://pytorch.org/docs/master/torchvision/models.html).  You may also like to [add batch normalization](http://pytorch.org/docs/master/nn.html#normalization-layers).  \n",
879 |     "\n",
880 |     "> You are **not** required to change anything about the encoder.\n",
881 |     "\n",
882 |     "For this project, you **must** incorporate a pre-trained CNN into your encoder.  Your `EncoderCNN` class must take `embed_size` as an input argument, which will also correspond to the dimensionality of the input to the RNN decoder that you will implement in Step 4.  When you train your model in the next notebook in this sequence (**2_Training.ipynb**), you are welcome to tweak the value of `embed_size`.\n",
883 |     "\n",
884 |     "If you decide to modify the `EncoderCNN` class, save **model.py** and re-execute the code cell above.  If the code cell returns an assertion error, then please follow the instructions to modify your code before proceeding.  The assert statements ensure that `features` is a PyTorch tensor with shape `[batch_size, embed_size]`."
885 |    ]
886 |   },
887 |   {
888 |    "cell_type": "markdown",
889 |    "metadata": {},
890 |    "source": [
891 |     "<a id='step4'></a>\n",
892 |     "## Step 4: Implement the RNN Decoder\n",
893 |     "\n",
894 |     "Before executing the next code cell, you must write `__init__` and `forward` methods in the `DecoderRNN` class in **model.py**.  (Do **not** write the `sample` method yet - you will work with this method when you reach **3_Inference.ipynb**.)\n",
895 |     "\n",
896 |     "> The `__init__` and `forward` methods in the `DecoderRNN` class are the only things that you **need** to modify as part of this notebook.  You will write more implementations in the notebooks that appear later in the sequence.\n",
897 |     "\n",
898 |     "Your decoder will be an instance of the `DecoderRNN` class and must accept as input:\n",
899 |     "- the PyTorch tensor `features` containing the embedded image features (outputted in Step 3, when the last batch of images from Step 2 was passed through `encoder`), along with\n",
900 |     "- a PyTorch tensor corresponding to the last batch of captions (`captions`) from Step 2.\n",
901 |     "\n",
902 |     "Note that the way we have written the data loader should simplify your code a bit.  In particular, every training batch will contain pre-processed captions where all have the same length (`captions.shape[1]`), so **you do not need to worry about padding**.  \n",
903 |     "> While you are encouraged to implement the decoder described in [this paper](https://arxiv.org/pdf/1411.4555.pdf), you are welcome to implement any architecture of your choosing, as long as it uses at least one RNN layer, with hidden dimension `hidden_size`.  \n",
904 |     "\n",
905 |     "Although you will test the decoder using the last batch that is currently stored in the notebook, your decoder should be written to accept an arbitrary batch (of embedded image features and pre-processed captions [where all captions have the same length]) as input.  \n",
906 |     "\n",
907 |     "![Decoder](images/decoder.png)\n",
908 |     "\n",
909 |     "In the code cell below, `outputs` should be a PyTorch tensor with size `[batch_size, captions.shape[1], vocab_size]`.  Your output should be designed such that `outputs[i,j,k]` contains the model's predicted score, indicating how likely the `j`-th token in the `i`-th caption in the batch is the `k`-th token in the vocabulary.  In the next notebook of the sequence (**2_Training.ipynb**), we provide code to supply these scores to the [`torch.nn.CrossEntropyLoss`](http://pytorch.org/docs/master/nn.html#torch.nn.CrossEntropyLoss) optimizer in PyTorch."
910 |    ]
911 |   },
912 |   {
913 |    "cell_type": "code",
914 |    "execution_count": 20,
915 |    "metadata": {},
916 |    "outputs": [
917 |     {
918 |      "name": "stdout",
919 |      "output_type": "stream",
920 |      "text": [
921 |       "type(outputs): <class 'torch.Tensor'>\n",
922 |       "outputs.shape: torch.Size([10, 15, 9955])\n"
923 |      ]
924 |     }
925 |    ],
926 |    "source": [
927 |     "# Specify the number of features in the hidden state of the RNN decoder.\n",
928 |     "hidden_size = 512\n",
929 |     "\n",
930 |     "#-#-#-# Do NOT modify the code below this line. #-#-#-#\n",
931 |     "\n",
932 |     "# Store the size of the vocabulary.\n",
933 |     "vocab_size = len(data_loader.dataset.vocab)\n",
934 |     "\n",
935 |     "# Initialize the decoder.\n",
936 |     "decoder = DecoderRNN(embed_size, hidden_size, vocab_size)\n",
937 |     "\n",
938 |     "# Move the decoder to GPU if CUDA is available.\n",
939 |     "decoder.to(device)\n",
940 |     "    \n",
941 |     "# Move last batch of captions (from Step 1) to GPU if CUDA is available \n",
942 |     "captions = captions.to(device)\n",
943 |     "\n",
944 |     "# Pass the encoder output and captions through the decoder.\n",
945 |     "outputs = decoder(features, captions)\n",
946 |     "\n",
947 |     "print('type(outputs):', type(outputs))\n",
948 |     "print('outputs.shape:', outputs.shape)\n",
949 |     "\n",
950 |     "# Check that your decoder satisfies some requirements of the project! :D\n",
951 |     "assert type(outputs)==torch.Tensor, \"Decoder output needs to be a PyTorch Tensor.\"\n",
952 |     "assert (outputs.shape[0]==batch_size) & (outputs.shape[1]==captions.shape[1]) & (outputs.shape[2]==vocab_size), \"The shape of the decoder output is incorrect.\""
953 |    ]
954 |   },
955 |   {
956 |    "cell_type": "markdown",
957 |    "metadata": {},
958 |    "source": [
959 |     "When you train your model in the next notebook in this sequence (**2_Training.ipynb**), you are welcome to tweak the value of `hidden_size`."
960 |    ]
961 |   }
962 |  ],
963 |  "metadata": {
964 |   "anaconda-cloud": {},
965 |   "kernelspec": {
966 |    "display_name": "Python 3",
967 |    "language": "python",
968 |    "name": "python3"
969 |   },
970 |   "language_info": {
971 |    "codemirror_mode": {
972 |     "name": "ipython",
973 |     "version": 3
974 |    },
975 |    "file_extension": ".py",
976 |    "mimetype": "text/x-python",
977 |    "name": "python",
978 |    "nbconvert_exporter": "python",
979 |    "pygments_lexer": "ipython3",
980 |    "version": "3.6.3"
981 |   }
982 |  },
983 |  "nbformat": 4,
984 |  "nbformat_minor": 2
985 | }
986 | 


--------------------------------------------------------------------------------
/2_Training (2).ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Computer Vision Nanodegree\n",
  8 |     "\n",
  9 |     "## Project: Image Captioning\n",
 10 |     "\n",
 11 |     "---\n",
 12 |     "\n",
 13 |     "In this notebook, you will train your CNN-RNN model.  \n",
 14 |     "\n",
 15 |     "You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.\n",
 16 |     "\n",
 17 |     "This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:\n",
 18 |     "- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  \n",
 19 |     "- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.\n",
 20 |     "\n",
 21 |     "This notebook **will be graded**.  \n",
 22 |     "\n",
 23 |     "Feel free to use the links below to navigate the notebook:\n",
 24 |     "- [Step 1](#step1): Training Setup\n",
 25 |     "- [Step 2](#step2): Train your Model\n",
 26 |     "- [Step 3](#step3): (Optional) Validate your Model"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": null,
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": []
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "<a id='step1'></a>\n",
 41 |     "## Step 1: Training Setup\n",
 42 |     "\n",
 43 |     "In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.\n",
 44 |     "\n",
 45 |     "You should only amend blocks of code that are preceded by a `TODO` statement.  **Any code blocks that are not preceded by a `TODO` statement should not be modified**.\n",
 46 |     "\n",
 47 |     "### Task #1\n",
 48 |     "\n",
 49 |     "Begin by setting the following variables:\n",
 50 |     "- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. \n",
 51 |     "- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  \n",
 52 |     "- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. \n",
 53 |     "- `embed_size` - the dimensionality of the image and word embeddings.  \n",
 54 |     "- `hidden_size` - the number of features in the hidden state of the RNN decoder.  \n",
 55 |     "- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)\n",
 56 |     "- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.\n",
 57 |     "- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.\n",
 58 |     "- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.\n",
 59 |     "\n",
 60 |     "If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.\n",
 61 |     "\n",
 62 |     "### Question 1\n",
 63 |     "\n",
 64 |     "**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.\n",
 65 |     "\n",
 66 |     "**Answer:** The decoder of my model had size of embedding(embed_size),Number of nodes in the hidden layer(hidden_size),The size of vocabulary or output size(vocab_size),Number of layers(num_layers) in the initialization part then I created a super class which initialized the hidden state size,vocab_size,embedding then I initialized the RNN i.e the LSTM in the same way provided in the earlier notebooks in this course \n",
 67 |     "                self.lstm = nn.LSTM(input_size = embed_size, \n",
 68 |     "                            hidden_size = hidden_size, \n",
 69 |     "                            num_layers = num_layers,\n",
 70 |     "                            batch_first = True)\n",
 71 |     "then the features that were extracted throughout the process are Linear(in_features=512, out_features=9955, bias=True).\n",
 72 |     "The forward function of the model contains the Tensor features of the embedded image features which has been taken from encoder(features ) and Tensor corresponding to the last batch of captions(captions). Since the caption shape turned out to be torch.Size([10, 15]) hence to reduce it I deducted 1 from the column size and reshaped it to \"torch.Size([10, 14])\". Then I procured the embedding and concatenated features to embedding then ran them through the the LSTM layer and finally through the linear layer who's outputs.shape: torch.Size([10, 15, 9955])\n",
 73 |     "\n",
 74 |     "In task 1, I selected these variables\n",
 75 |     "batch_size = 128          # batch size\n",
 76 |     "vocab_threshold = 5        # minimum word count threshold\n",
 77 |     "vocab_from_file = False    # if True, load existing vocab file\n",
 78 |     "embed_size = 300          # dimensionality of image and word embeddings\n",
 79 |     "hidden_size = 512 \n",
 80 |     "\n",
 81 |     "I considered batch size 32 earlier but the training process turned out to be time consuming at first, so I increased the batch size upto 128 since the ideal batch sizes are considered to be 32,62,128 .I chose the largest one. The vocab threshold,hidden size,embedding size was purely on the basis of udacity video lectures.\n",
 82 |     "\n",
 83 |     "In the sample ,I initialized and empty list for predictions then ran through the linear layer.Removed single-dimensional entries from the shape of an array.Got the maximum probabilities & appended the result into a list ,finally updated the input.\n",
 84 |     "\n",
 85 |     "Yes I read (https://arxiv.org/pdf/1502.03044.pdf) and (https://arxiv.org/pdf/1411.4555.pdf) for useful guidance.\n",
 86 |     "\n",
 87 |     "\n",
 88 |     "\n",
 89 |     "\n",
 90 |     "### (Optional) Task #2\n",
 91 |     "\n",
 92 |     "Note that we have provided a recommended image transform `transform_train` for pre-processing the training images, but you are welcome (and encouraged!) to modify it as you wish.  When modifying this transform, keep in mind that:\n",
 93 |     "- the images in the dataset have varying heights and widths, and \n",
 94 |     "- if using a pre-trained model, you must perform the corresponding appropriate normalization.\n",
 95 |     "\n",
 96 |     "### Question 2\n",
 97 |     "\n",
 98 |     "**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?\n",
 99 |     "\n",
100 |     "**Answer:** I left the transform at its provided value because the images in the dataset have varying heights and widths and to perform appropriate normalization was a necessary step. It took me many hours to decipher whether the model would perform best on the given transform or not. It included looking through various research papers regarding image captioning and it resulted in being content with the given transform since it best fit my model as well.\n",
101 |     "\n",
102 |     "### Task #3\n",
103 |     "\n",
104 |     "Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:\n",
105 |     "```\n",
106 |     "params = list(decoder.parameters()) + list(encoder.embed.parameters()) \n",
107 |     "```\n",
108 |     "\n",
109 |     "### Question 3\n",
110 |     "\n",
111 |     "**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?\n",
112 |     "\n",
113 |     "**Answer:** Since in the image captioning model there was a need to train the weights in the embedding layer of the encoder only and in addition it also required to make all weights in the decoder trainable.There's no need to explicitly pass the embedding.parameters() to the list, since I've already gave it encoder.parameters() which includes embedding.parameters().\n",
114 |     "\n",
115 |     "### Task #4\n",
116 |     "\n",
117 |     "Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).\n",
118 |     "\n",
119 |     "### Question 4\n",
120 |     "\n",
121 |     "**Question:** How did you select the optimizer used to train your model?\n",
122 |     "\n",
123 |     "**Answer:** Since it implements Adam algorithm.\n",
124 |     "params (iterable) – iterable of parameters to optimize or dicts defining parameter groups\n",
125 |     "lr (float, optional) – learning rate \n",
126 |     "\n",
127 |     "and after rigorous training of the model I found out that adam worked far better than any other optimizer algorithm.Hence I chose it.I started it with ideal learning rate i.e.0.001 and it seemed to work well even after many epochs, therefore I didnot alter it."
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 1,
133 |    "metadata": {
134 |     "scrolled": true
135 |    },
136 |    "outputs": [
137 |     {
138 |      "name": "stdout",
139 |      "output_type": "stream",
140 |      "text": [
141 |       "loading annotations into memory...\n",
142 |       "Done (t=0.92s)\n",
143 |       "creating index...\n",
144 |       "index created!\n",
145 |       "[0/414113] Tokenizing captions...\n",
146 |       "[100000/414113] Tokenizing captions...\n",
147 |       "[200000/414113] Tokenizing captions...\n",
148 |       "[300000/414113] Tokenizing captions...\n",
149 |       "[400000/414113] Tokenizing captions...\n",
150 |       "loading annotations into memory...\n",
151 |       "Done (t=0.94s)\n",
152 |       "creating index...\n"
153 |      ]
154 |     },
155 |     {
156 |      "name": "stderr",
157 |      "output_type": "stream",
158 |      "text": [
159 |       "  0%|          | 1160/414113 [00:00<01:11, 5814.55it/s]"
160 |      ]
161 |     },
162 |     {
163 |      "name": "stdout",
164 |      "output_type": "stream",
165 |      "text": [
166 |       "index created!\n",
167 |       "Obtaining caption lengths...\n"
168 |      ]
169 |     },
170 |     {
171 |      "name": "stderr",
172 |      "output_type": "stream",
173 |      "text": [
174 |       "100%|██████████| 414113/414113 [01:12<00:00, 5741.15it/s]\n"
175 |      ]
176 |     }
177 |    ],
178 |    "source": [
179 |     "import torch\n",
180 |     "import torch.nn as nn\n",
181 |     "from torchvision import transforms\n",
182 |     "import sys\n",
183 |     "sys.path.append('/opt/cocoapi/PythonAPI')\n",
184 |     "from pycocotools.coco import COCO\n",
185 |     "from data_loader import get_loader\n",
186 |     "from model import EncoderCNN, DecoderRNN\n",
187 |     "import math\n",
188 |     "\n",
189 |     "\n",
190 |     "## TODO #1: Select appropriate values for the Python variables below.\n",
191 |     "batch_size = 128          # batch size\n",
192 |     "vocab_threshold = 5        # minimum word count threshold\n",
193 |     "vocab_from_file = False    # if True, load existing vocab file\n",
194 |     "embed_size = 300          # dimensionality of image and word embeddings\n",
195 |     "hidden_size = 512         # number of features in hidden state of the RNN decoder\n",
196 |     "num_epochs = 3             # number of training epochs\n",
197 |     "save_every = 1             # determines frequency of saving model weights\n",
198 |     "print_every = 100          # determines window for printing average loss\n",
199 |     "log_file = 'training_log.txt'       # name of file with saved training loss and perplexity\n",
200 |     "\n",
201 |     "# (Optional) TODO #2: Amend the image transform below.\n",
202 |     "transform_train = transforms.Compose([ \n",
203 |     "    transforms.Resize(256),                          # smaller edge of image resized to 256\n",
204 |     "    transforms.RandomCrop(224),                      # get 224x224 crop from random location\n",
205 |     "    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5\n",
206 |     "    transforms.ToTensor(),                           # convert the PIL Image to a tensor\n",
207 |     "    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model\n",
208 |     "                         (0.229, 0.224, 0.225))])\n",
209 |     "\n",
210 |     "# Build data loader.\n",
211 |     "data_loader = get_loader(transform=transform_train,\n",
212 |     "                         mode='train',\n",
213 |     "                         batch_size=batch_size,\n",
214 |     "                         vocab_threshold=vocab_threshold,\n",
215 |     "                         vocab_from_file=vocab_from_file)\n",
216 |     "\n",
217 |     "# The size of the vocabulary.\n",
218 |     "vocab_size = len(data_loader.dataset.vocab)\n",
219 |     "\n",
220 |     "# Initialize the encoder and decoder. \n",
221 |     "encoder = EncoderCNN(embed_size)\n",
222 |     "decoder = DecoderRNN(embed_size, hidden_size, vocab_size)\n",
223 |     "\n",
224 |     "# Move models to GPU if CUDA is available. \n",
225 |     "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
226 |     "encoder.to(device)\n",
227 |     "decoder.to(device)\n",
228 |     "\n",
229 |     "# Define the loss function. \n",
230 |     "criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()\n",
231 |     "\n",
232 |     "# TODO #3: Specify the learnable parameters of the model.\n",
233 |     "params = list(decoder.parameters()) + list(encoder.embed.parameters())\n",
234 |     "\n",
235 |     "# TODO #4: Define the optimizer.\n",
236 |     "optimizer = torch.optim.Adam(params = params, lr = 0.001)\n",
237 |     "\n",
238 |     "# Set the total number of training steps per epoch.\n",
239 |     "total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "<a id='step2'></a>\n",
247 |     "## Step 2: Train your Model\n",
248 |     "\n",
249 |     "Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  \n",
250 |     "\n",
251 |     "It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  \n",
252 |     "\n",
253 |     "You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:\n",
254 |     "\n",
255 |     "```python\n",
256 |     "# Load pre-trained weights before resuming training.\n",
257 |     "encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))\n",
258 |     "decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))\n",
259 |     "```\n",
260 |     "\n",
261 |     "While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).\n",
262 |     "\n",
263 |     "### A Note on Tuning Hyperparameters\n",
264 |     "\n",
265 |     "To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  \n",
266 |     "\n",
267 |     "However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  \n",
268 |     "\n",
269 |     "For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.\n",
270 |     "\n",
271 |     "That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset."
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": 2,
277 |    "metadata": {
278 |     "scrolled": true
279 |    },
280 |    "outputs": [
281 |     {
282 |      "name": "stdout",
283 |      "output_type": "stream",
284 |      "text": [
285 |       "Epoch [1/3], Step [100/3236], Loss: 3.6855, Perplexity: 39.8648\n",
286 |       "Epoch [1/3], Step [200/3236], Loss: 3.1835, Perplexity: 24.13045\n",
287 |       "Epoch [1/3], Step [300/3236], Loss: 3.1520, Perplexity: 23.3818\n",
288 |       "Epoch [1/3], Step [400/3236], Loss: 2.9596, Perplexity: 19.2901\n",
289 |       "Epoch [1/3], Step [500/3236], Loss: 2.7796, Perplexity: 16.1129\n",
290 |       "Epoch [1/3], Step [600/3236], Loss: 3.1020, Perplexity: 22.2421\n",
291 |       "Epoch [1/3], Step [700/3236], Loss: 2.9611, Perplexity: 19.3190\n",
292 |       "Epoch [1/3], Step [800/3236], Loss: 2.6473, Perplexity: 14.1161\n",
293 |       "Epoch [1/3], Step [900/3236], Loss: 2.7564, Perplexity: 15.7423\n",
294 |       "Epoch [1/3], Step [1000/3236], Loss: 2.7760, Perplexity: 16.0544\n",
295 |       "Epoch [1/3], Step [1100/3236], Loss: 2.5450, Perplexity: 12.7434\n",
296 |       "Epoch [1/3], Step [1200/3236], Loss: 2.6158, Perplexity: 13.6777\n",
297 |       "Epoch [1/3], Step [1300/3236], Loss: 2.3420, Perplexity: 10.4015\n",
298 |       "Epoch [1/3], Step [1400/3236], Loss: 2.3720, Perplexity: 10.7193\n",
299 |       "Epoch [1/3], Step [1500/3236], Loss: 2.3591, Perplexity: 10.5819\n",
300 |       "Epoch [1/3], Step [1600/3236], Loss: 2.3895, Perplexity: 10.9082\n",
301 |       "Epoch [1/3], Step [1700/3236], Loss: 2.3000, Perplexity: 9.97432\n",
302 |       "Epoch [1/3], Step [1800/3236], Loss: 2.4111, Perplexity: 11.1460\n",
303 |       "Epoch [1/3], Step [1900/3236], Loss: 2.3256, Perplexity: 10.2331\n",
304 |       "Epoch [1/3], Step [2000/3236], Loss: 2.5036, Perplexity: 12.2266\n",
305 |       "Epoch [1/3], Step [2100/3236], Loss: 2.2665, Perplexity: 9.64527\n",
306 |       "Epoch [1/3], Step [2200/3236], Loss: 2.2322, Perplexity: 9.32052\n",
307 |       "Epoch [1/3], Step [2300/3236], Loss: 2.2872, Perplexity: 9.84691\n",
308 |       "Epoch [1/3], Step [2400/3236], Loss: 2.3724, Perplexity: 10.7233\n",
309 |       "Epoch [1/3], Step [2500/3236], Loss: 2.1894, Perplexity: 8.92966\n",
310 |       "Epoch [1/3], Step [2600/3236], Loss: 2.1231, Perplexity: 8.35730\n",
311 |       "Epoch [1/3], Step [2700/3236], Loss: 2.2506, Perplexity: 9.49368\n",
312 |       "Epoch [1/3], Step [2800/3236], Loss: 2.3235, Perplexity: 10.2115\n",
313 |       "Epoch [1/3], Step [2900/3236], Loss: 2.0984, Perplexity: 8.15301\n",
314 |       "Epoch [1/3], Step [3000/3236], Loss: 2.1378, Perplexity: 8.48080\n",
315 |       "Epoch [1/3], Step [3100/3236], Loss: 2.0625, Perplexity: 7.86565\n",
316 |       "Epoch [1/3], Step [3200/3236], Loss: 2.1923, Perplexity: 8.95542\n",
317 |       "Epoch [2/3], Step [100/3236], Loss: 2.2530, Perplexity: 9.515804\n",
318 |       "Epoch [2/3], Step [200/3236], Loss: 2.1474, Perplexity: 8.56279\n",
319 |       "Epoch [2/3], Step [300/3236], Loss: 2.2944, Perplexity: 9.91867\n",
320 |       "Epoch [2/3], Step [400/3236], Loss: 2.0238, Perplexity: 7.56727\n",
321 |       "Epoch [2/3], Step [500/3236], Loss: 2.0367, Perplexity: 7.66512\n",
322 |       "Epoch [2/3], Step [600/3236], Loss: 2.4124, Perplexity: 11.1607\n",
323 |       "Epoch [2/3], Step [700/3236], Loss: 2.0442, Perplexity: 7.72290\n",
324 |       "Epoch [2/3], Step [800/3236], Loss: 2.1923, Perplexity: 8.95561\n",
325 |       "Epoch [2/3], Step [900/3236], Loss: 2.4206, Perplexity: 11.2526\n",
326 |       "Epoch [2/3], Step [1000/3236], Loss: 2.0631, Perplexity: 7.8703\n",
327 |       "Epoch [2/3], Step [1100/3236], Loss: 2.1664, Perplexity: 8.72686\n",
328 |       "Epoch [2/3], Step [1200/3236], Loss: 1.9771, Perplexity: 7.22205\n",
329 |       "Epoch [2/3], Step [1300/3236], Loss: 3.0816, Perplexity: 21.7926\n",
330 |       "Epoch [2/3], Step [1400/3236], Loss: 1.9668, Perplexity: 7.14784\n",
331 |       "Epoch [2/3], Step [1500/3236], Loss: 2.2333, Perplexity: 9.33065\n",
332 |       "Epoch [2/3], Step [1600/3236], Loss: 3.6297, Perplexity: 37.7023\n",
333 |       "Epoch [2/3], Step [1700/3236], Loss: 2.1417, Perplexity: 8.51361\n",
334 |       "Epoch [2/3], Step [1800/3236], Loss: 2.1379, Perplexity: 8.48205\n",
335 |       "Epoch [2/3], Step [1900/3236], Loss: 1.8903, Perplexity: 6.62108\n",
336 |       "Epoch [2/3], Step [2000/3236], Loss: 2.2316, Perplexity: 9.31480\n",
337 |       "Epoch [2/3], Step [2100/3236], Loss: 2.2831, Perplexity: 9.80692\n",
338 |       "Epoch [2/3], Step [2200/3236], Loss: 2.0057, Perplexity: 7.43098\n",
339 |       "Epoch [2/3], Step [2300/3236], Loss: 2.0166, Perplexity: 7.51291\n",
340 |       "Epoch [2/3], Step [2400/3236], Loss: 3.0270, Perplexity: 20.6343\n",
341 |       "Epoch [2/3], Step [2500/3236], Loss: 2.0671, Perplexity: 7.90159\n",
342 |       "Epoch [2/3], Step [2600/3236], Loss: 1.9868, Perplexity: 7.29211\n",
343 |       "Epoch [2/3], Step [2700/3236], Loss: 2.0208, Perplexity: 7.54424\n",
344 |       "Epoch [2/3], Step [2800/3236], Loss: 2.1385, Perplexity: 8.48640\n",
345 |       "Epoch [2/3], Step [2900/3236], Loss: 2.1091, Perplexity: 8.24102\n",
346 |       "Epoch [2/3], Step [3000/3236], Loss: 1.9412, Perplexity: 6.96707\n",
347 |       "Epoch [2/3], Step [3100/3236], Loss: 1.9900, Perplexity: 7.31524\n",
348 |       "Epoch [2/3], Step [3200/3236], Loss: 2.0261, Perplexity: 7.58418\n",
349 |       "Epoch [3/3], Step [100/3236], Loss: 2.0579, Perplexity: 7.829903\n",
350 |       "Epoch [3/3], Step [200/3236], Loss: 1.9523, Perplexity: 7.04482\n",
351 |       "Epoch [3/3], Step [300/3236], Loss: 2.0857, Perplexity: 8.04990\n",
352 |       "Epoch [3/3], Step [400/3236], Loss: 2.1414, Perplexity: 8.51176\n",
353 |       "Epoch [3/3], Step [500/3236], Loss: 1.9460, Perplexity: 7.00066\n",
354 |       "Epoch [3/3], Step [600/3236], Loss: 1.8885, Perplexity: 6.60955\n",
355 |       "Epoch [3/3], Step [800/3236], Loss: 2.2593, Perplexity: 9.57692\n",
356 |       "Epoch [3/3], Step [900/3236], Loss: 2.0536, Perplexity: 7.79562\n",
357 |       "Epoch [3/3], Step [1000/3236], Loss: 2.0305, Perplexity: 7.6182\n",
358 |       "Epoch [3/3], Step [1100/3236], Loss: 2.0136, Perplexity: 7.49038\n",
359 |       "Epoch [3/3], Step [1200/3236], Loss: 2.6828, Perplexity: 14.6257\n",
360 |       "Epoch [3/3], Step [1300/3236], Loss: 1.9822, Perplexity: 7.25877\n",
361 |       "Epoch [3/3], Step [1400/3236], Loss: 2.1597, Perplexity: 8.66869\n",
362 |       "Epoch [3/3], Step [1500/3236], Loss: 1.8902, Perplexity: 6.62067\n",
363 |       "Epoch [3/3], Step [1600/3236], Loss: 1.9092, Perplexity: 6.74750\n",
364 |       "Epoch [3/3], Step [1700/3236], Loss: 1.9613, Perplexity: 7.10833\n",
365 |       "Epoch [3/3], Step [1800/3236], Loss: 1.8610, Perplexity: 6.43012\n",
366 |       "Epoch [3/3], Step [1900/3236], Loss: 1.8117, Perplexity: 6.12101\n",
367 |       "Epoch [3/3], Step [2000/3236], Loss: 1.9335, Perplexity: 6.91389\n",
368 |       "Epoch [3/3], Step [2100/3236], Loss: 1.7751, Perplexity: 5.90098\n",
369 |       "Epoch [3/3], Step [2200/3236], Loss: 2.0858, Perplexity: 8.05076\n",
370 |       "Epoch [3/3], Step [2300/3236], Loss: 2.1024, Perplexity: 8.18552\n",
371 |       "Epoch [3/3], Step [2400/3236], Loss: 1.8319, Perplexity: 6.24593\n",
372 |       "Epoch [3/3], Step [2500/3236], Loss: 2.1907, Perplexity: 8.94144\n",
373 |       "Epoch [3/3], Step [2600/3236], Loss: 1.8778, Perplexity: 6.53925\n",
374 |       "Epoch [3/3], Step [2700/3236], Loss: 1.9238, Perplexity: 6.84671\n",
375 |       "Epoch [3/3], Step [2800/3236], Loss: 1.8410, Perplexity: 6.30269\n",
376 |       "Epoch [3/3], Step [2900/3236], Loss: 1.9802, Perplexity: 7.243967\n",
377 |       "Epoch [3/3], Step [3000/3236], Loss: 2.2119, Perplexity: 9.13334\n",
378 |       "Epoch [3/3], Step [3100/3236], Loss: 1.9330, Perplexity: 6.91039\n",
379 |       "Epoch [3/3], Step [3200/3236], Loss: 2.0320, Perplexity: 7.62924\n",
380 |       "Epoch [3/3], Step [3236/3236], Loss: 2.4982, Perplexity: 12.1601"
381 |      ]
382 |     }
383 |    ],
384 |    "source": [
385 |     "import torch.utils.data as data\n",
386 |     "import numpy as np\n",
387 |     "import os\n",
388 |     "import requests\n",
389 |     "import time\n",
390 |     "\n",
391 |     "# Open the training log file.\n",
392 |     "f = open(log_file, 'w')\n",
393 |     "\n",
394 |     "old_time = time.time()\n",
395 |     "response = requests.request(\"GET\", \n",
396 |     "                            \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token\", \n",
397 |     "                            headers={\"Metadata-Flavor\":\"Google\"})\n",
398 |     "\n",
399 |     "for epoch in range(1, num_epochs+1):\n",
400 |     "    \n",
401 |     "    for i_step in range(1, total_step+1):\n",
402 |     "        \n",
403 |     "        if time.time() - old_time > 60:\n",
404 |     "            old_time = time.time()\n",
405 |     "            requests.request(\"POST\", \n",
406 |     "                             \"https://nebula.udacity.com/api/v1/remote/keep-alive\", \n",
407 |     "                             headers={'Authorization': \"STAR \" + response.text})\n",
408 |     "        \n",
409 |     "        # Randomly sample a caption length, and sample indices with that length.\n",
410 |     "        indices = data_loader.dataset.get_train_indices()\n",
411 |     "        # Create and assign a batch sampler to retrieve a batch with the sampled indices.\n",
412 |     "        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)\n",
413 |     "        data_loader.batch_sampler.sampler = new_sampler\n",
414 |     "        \n",
415 |     "        # Obtain the batch.\n",
416 |     "        images, captions = next(iter(data_loader))\n",
417 |     "\n",
418 |     "        # Move batch of images and captions to GPU if CUDA is available.\n",
419 |     "        images = images.to(device)\n",
420 |     "        captions = captions.to(device)\n",
421 |     "        \n",
422 |     "        # Zero the gradients.\n",
423 |     "        decoder.zero_grad()\n",
424 |     "        encoder.zero_grad()\n",
425 |     "        \n",
426 |     "        # Pass the inputs through the CNN-RNN model.\n",
427 |     "        features = encoder(images)\n",
428 |     "        outputs = decoder(features, captions)\n",
429 |     "        \n",
430 |     "        # Calculate the batch loss.\n",
431 |     "        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))\n",
432 |     "        \n",
433 |     "        # Backward pass.\n",
434 |     "        loss.backward()\n",
435 |     "        \n",
436 |     "        # Update the parameters in the optimizer.\n",
437 |     "        optimizer.step()\n",
438 |     "            \n",
439 |     "        # Get training statistics.\n",
440 |     "        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))\n",
441 |     "        \n",
442 |     "        # Print training statistics (on same line).\n",
443 |     "        print('\\r' + stats, end=\"\")\n",
444 |     "        sys.stdout.flush()\n",
445 |     "        \n",
446 |     "        # Print training statistics to file.\n",
447 |     "        f.write(stats + '\\n')\n",
448 |     "        f.flush()\n",
449 |     "        \n",
450 |     "        # Print training statistics (on different line).\n",
451 |     "        if i_step % print_every == 0:\n",
452 |     "            print('\\r' + stats)\n",
453 |     "            \n",
454 |     "    # Save the weights.\n",
455 |     "    if epoch % save_every == 0:\n",
456 |     "        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))\n",
457 |     "        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))\n",
458 |     "\n",
459 |     "# Close the training log file.\n",
460 |     "f.close()"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "markdown",
465 |    "metadata": {},
466 |    "source": [
467 |     "<a id='step3'></a>\n",
468 |     "## Step 3: (Optional) Validate your Model\n",
469 |     "\n",
470 |     "To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. \n",
471 |     "\n",
472 |     "If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:\n",
473 |     "- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and\n",
474 |     "- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.\n",
475 |     "\n",
476 |     "The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset."
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "code",
481 |    "execution_count": 3,
482 |    "metadata": {},
483 |    "outputs": [],
484 |    "source": [
485 |     "# (Optional) TODO: Validate your model."
486 |    ]
487 |   }
488 |  ],
489 |  "metadata": {
490 |   "anaconda-cloud": {},
491 |   "kernelspec": {
492 |    "display_name": "Python 3",
493 |    "language": "python",
494 |    "name": "python3"
495 |   },
496 |   "language_info": {
497 |    "codemirror_mode": {
498 |     "name": "ipython",
499 |     "version": 3
500 |    },
501 |    "file_extension": ".py",
502 |    "mimetype": "text/x-python",
503 |    "name": "python",
504 |    "nbconvert_exporter": "python",
505 |    "pygments_lexer": "ipython3",
506 |    "version": "3.6.3"
507 |   }
508 |  },
509 |  "nbformat": 4,
510 |  "nbformat_minor": 2
511 | }
512 | 


--------------------------------------------------------------------------------
/Images/decoder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Garima13a/Automatic-Image-Captioning/65e475248eadc4fe85e1f108d5ac7502f24113ae/Images/decoder.png


--------------------------------------------------------------------------------
/Images/download_ex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Garima13a/Automatic-Image-Captioning/65e475248eadc4fe85e1f108d5ac7502f24113ae/Images/download_ex.png


--------------------------------------------------------------------------------
/Images/encoder-decoder.png.crdownload:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Garima13a/Automatic-Image-Captioning/65e475248eadc4fe85e1f108d5ac7502f24113ae/Images/encoder-decoder.png.crdownload


--------------------------------------------------------------------------------
/Images/image:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Garima Nishad
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Automatic-Image-Captioning
 2 | In this project, I have created a neural network architecture to automatically generate captions from images.  After using the Microsoft Common Objects in COntext (MS COCO) dataset to train my network, I have tested my network on novel images!
 3 | 
 4 | How to run:
 5 | 
 6 | 1. Run 0_Dataset file
 7 | 2. Then 1,2,3 files
 8 | 3. model.py contains the architecture CNN-RNN model which you can modify if you want to experiment.
 9 | 4. Dataloader loads the data.
10 | 
11 | Please refer to my blog to know more: https://blog.goodaudience.com/automatic-image-captioning-building-an-image-caption-generator-from-scratch-4bdd8744bc38?sk=0cd300a528a02be0596f3e042a82f7a4
12 | 
13 | LinkedIn: https://www.linkedin.com/in/garima-nishad-9b8385134/
14 | 
15 | Follow me on Twitter: https://twitter.com/garima__nishad
16 | 


--------------------------------------------------------------------------------
/data_loader.py:
--------------------------------------------------------------------------------
  1 | import nltk
  2 | import os
  3 | import torch
  4 | import torch.utils.data as data
  5 | from vocabulary import Vocabulary
  6 | from PIL import Image
  7 | from pycocotools.coco import COCO
  8 | import numpy as np
  9 | from tqdm import tqdm
 10 | import random
 11 | import json
 12 | 
 13 | def get_loader(transform,
 14 |                mode='train',
 15 |                batch_size=1,
 16 |                vocab_threshold=None,
 17 |                vocab_file='./vocab.pkl',
 18 |                start_word="<start>",
 19 |                end_word="<end>",
 20 |                unk_word="<unk>",
 21 |                vocab_from_file=True,
 22 |                num_workers=0,
 23 |                cocoapi_loc='/opt'):
 24 |     """Returns the data loader.
 25 |     Args:
 26 |       transform: Image transform.
 27 |       mode: One of 'train' or 'test'.
 28 |       batch_size: Batch size (if in testing mode, must have batch_size=1).
 29 |       vocab_threshold: Minimum word count threshold.
 30 |       vocab_file: File containing the vocabulary. 
 31 |       start_word: Special word denoting sentence start.
 32 |       end_word: Special word denoting sentence end.
 33 |       unk_word: Special word denoting unknown words.
 34 |       vocab_from_file: If False, create vocab from scratch & override any existing vocab_file.
 35 |                        If True, load vocab from from existing vocab_file, if it exists.
 36 |       num_workers: Number of subprocesses to use for data loading 
 37 |       cocoapi_loc: The location of the folder containing the COCO API: https://github.com/cocodataset/cocoapi
 38 |     """
 39 |     
 40 |     assert mode in ['train', 'test'], "mode must be one of 'train' or 'test'."
 41 |     if vocab_from_file==False: assert mode=='train', "To generate vocab from captions file, must be in training mode (mode='train')."
 42 | 
 43 |     # Based on mode (train, val, test), obtain img_folder and annotations_file.
 44 |     if mode == 'train':
 45 |         if vocab_from_file==True: assert os.path.exists(vocab_file), "vocab_file does not exist.  Change vocab_from_file to False to create vocab_file."
 46 |         img_folder = os.path.join(cocoapi_loc, 'cocoapi/images/train2014/')
 47 |         annotations_file = os.path.join(cocoapi_loc, 'cocoapi/annotations/captions_train2014.json')
 48 |     if mode == 'test':
 49 |         assert batch_size==1, "Please change batch_size to 1 if testing your model."
 50 |         assert os.path.exists(vocab_file), "Must first generate vocab.pkl from training data."
 51 |         assert vocab_from_file==True, "Change vocab_from_file to True."
 52 |         img_folder = os.path.join(cocoapi_loc, 'cocoapi/images/test2014/')
 53 |         annotations_file = os.path.join(cocoapi_loc, 'cocoapi/annotations/image_info_test2014.json')
 54 | 
 55 |     # COCO caption dataset.
 56 |     dataset = CoCoDataset(transform=transform,
 57 |                           mode=mode,
 58 |                           batch_size=batch_size,
 59 |                           vocab_threshold=vocab_threshold,
 60 |                           vocab_file=vocab_file,
 61 |                           start_word=start_word,
 62 |                           end_word=end_word,
 63 |                           unk_word=unk_word,
 64 |                           annotations_file=annotations_file,
 65 |                           vocab_from_file=vocab_from_file,
 66 |                           img_folder=img_folder)
 67 | 
 68 |     if mode == 'train':
 69 |         # Randomly sample a caption length, and sample indices with that length.
 70 |         indices = dataset.get_train_indices()
 71 |         # Create and assign a batch sampler to retrieve a batch with the sampled indices.
 72 |         initial_sampler = data.sampler.SubsetRandomSampler(indices=indices)
 73 |         # data loader for COCO dataset.
 74 |         data_loader = data.DataLoader(dataset=dataset, 
 75 |                                       num_workers=num_workers,
 76 |                                       batch_sampler=data.sampler.BatchSampler(sampler=initial_sampler,
 77 |                                                                               batch_size=dataset.batch_size,
 78 |                                                                               drop_last=False))
 79 |     else:
 80 |         data_loader = data.DataLoader(dataset=dataset,
 81 |                                       batch_size=dataset.batch_size,
 82 |                                       shuffle=True,
 83 |                                       num_workers=num_workers)
 84 | 
 85 |     return data_loader
 86 | 
 87 | class CoCoDataset(data.Dataset):
 88 |     
 89 |     def __init__(self, transform, mode, batch_size, vocab_threshold, vocab_file, start_word, 
 90 |         end_word, unk_word, annotations_file, vocab_from_file, img_folder):
 91 |         self.transform = transform
 92 |         self.mode = mode
 93 |         self.batch_size = batch_size
 94 |         self.vocab = Vocabulary(vocab_threshold, vocab_file, start_word,
 95 |             end_word, unk_word, annotations_file, vocab_from_file)
 96 |         self.img_folder = img_folder
 97 |         if self.mode == 'train':
 98 |             self.coco = COCO(annotations_file)
 99 |             self.ids = list(self.coco.anns.keys())
100 |             print('Obtaining caption lengths...')
101 |             all_tokens = [nltk.tokenize.word_tokenize(str(self.coco.anns[self.ids[index]]['caption']).lower()) for index in tqdm(np.arange(len(self.ids)))]
102 |             self.caption_lengths = [len(token) for token in all_tokens]
103 |         else:
104 |             test_info = json.loads(open(annotations_file).read())
105 |             self.paths = [item['file_name'] for item in test_info['images']]
106 |         
107 |     def __getitem__(self, index):
108 |         # obtain image and caption if in training mode
109 |         if self.mode == 'train':
110 |             ann_id = self.ids[index]
111 |             caption = self.coco.anns[ann_id]['caption']
112 |             img_id = self.coco.anns[ann_id]['image_id']
113 |             path = self.coco.loadImgs(img_id)[0]['file_name']
114 | 
115 |             # Convert image to tensor and pre-process using transform
116 |             image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')
117 |             image = self.transform(image)
118 | 
119 |             # Convert caption to tensor of word ids.
120 |             tokens = nltk.tokenize.word_tokenize(str(caption).lower())
121 |             caption = []
122 |             caption.append(self.vocab(self.vocab.start_word))
123 |             caption.extend([self.vocab(token) for token in tokens])
124 |             caption.append(self.vocab(self.vocab.end_word))
125 |             caption = torch.Tensor(caption).long()
126 | 
127 |             # return pre-processed image and caption tensors
128 |             return image, caption
129 | 
130 |         # obtain image if in test mode
131 |         else:
132 |             path = self.paths[index]
133 | 
134 |             # Convert image to tensor and pre-process using transform
135 |             PIL_image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')
136 |             orig_image = np.array(PIL_image)
137 |             image = self.transform(PIL_image)
138 | 
139 |             # return original image and pre-processed image tensor
140 |             return orig_image, image
141 | 
142 |     def get_train_indices(self):
143 |         sel_length = np.random.choice(self.caption_lengths)
144 |         all_indices = np.where([self.caption_lengths[i] == sel_length for i in np.arange(len(self.caption_lengths))])[0]
145 |         indices = list(np.random.choice(all_indices, size=self.batch_size))
146 |         return indices
147 | 
148 |     def __len__(self):
149 |         if self.mode == 'train':
150 |             return len(self.ids)
151 |         else:
152 |             return len(self.paths)


--------------------------------------------------------------------------------
/model (2).py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torchvision.models as models
 4 | 
 5 | 
 6 | class EncoderCNN(nn.Module):
 7 |     def __init__(self, embed_size):
 8 |         super(EncoderCNN, self).__init__()
 9 |         resnet = models.resnet50(pretrained=True)
10 |         for param in resnet.parameters():
11 |             param.requires_grad_(False)
12 |         
13 |         modules = list(resnet.children())[:-1]
14 |         self.resnet = nn.Sequential(*modules)
15 |         self.embed = nn.Linear(resnet.fc.in_features, embed_size)
16 | 
17 |     def forward(self, images):
18 |         features = self.resnet(images)
19 |         features = features.view(features.size(0), -1)
20 |         features = self.embed(features)
21 |         return features
22 |     
23 | 
24 | class DecoderRNN(nn.Module):
25 |     def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
26 |         super(DecoderRNN, self).__init__()
27 |         
28 |         self.hidden_size = hidden_size
29 |         self.vocab_size = vocab_size
30 |         self.embed = nn.Embedding(num_embeddings = vocab_size,
31 |                                   embedding_dim = embed_size)
32 |         
33 |         self.lstm = nn.LSTM(input_size = embed_size,
34 |                             hidden_size = hidden_size,
35 |                             num_layers = num_layers,
36 |                             batch_first = True) 
37 |         
38 |         self.linear = nn.Linear(in_features = hidden_size,
39 |                                 out_features = vocab_size)
40 |         
41 |         
42 |         
43 |     
44 |     def forward(self, features, captions):
45 |         captions = captions[:, :-1]
46 |         embedding = self.embed(captions)
47 |         embedding = torch.cat((features.unsqueeze(dim = 1), embedding), dim = 1)
48 |         lstm_out, hidden = self.lstm(embedding)
49 |         outputs = self.linear(lstm_out)
50 |         return outputs
51 | 
52 |     def sample(self, inputs, states=None, max_len=20):
53 |         " accepts pre-processed image tensor (inputs) and returns predicted sentence (list of tensor ids of length max_len) "
54 |         predicted_sentence = []
55 |         for index in range(max_len):
56 |             
57 |             
58 |             lstm_out, states = self.lstm(inputs, states)
59 | 
60 |             
61 |             lstm_out = lstm_out.squeeze(1)
62 |             outputs = self.linear(lstm_out)
63 |             
64 |             
65 |             target = outputs.max(1)[1]
66 |             
67 |             
68 |             predicted_sentence.append(target.item())
69 |             
70 |             
71 |             inputs = self.embed(target).unsqueeze(1)
72 |             
73 |         return predicted_sentence


--------------------------------------------------------------------------------
/vocabulary.py:
--------------------------------------------------------------------------------
 1 | import nltk
 2 | import pickle
 3 | import os.path
 4 | from pycocotools.coco import COCO
 5 | from collections import Counter
 6 | 
 7 | class Vocabulary(object):
 8 | 
 9 |     def __init__(self,
10 |         vocab_threshold,
11 |         vocab_file='./vocab.pkl',
12 |         start_word="<start>",
13 |         end_word="<end>",
14 |         unk_word="<unk>",
15 |         annotations_file='../cocoapi/annotations/captions_train2014.json',
16 |         vocab_from_file=False):
17 |         """Initialize the vocabulary.
18 |         Args:
19 |           vocab_threshold: Minimum word count threshold.
20 |           vocab_file: File containing the vocabulary.
21 |           start_word: Special word denoting sentence start.
22 |           end_word: Special word denoting sentence end.
23 |           unk_word: Special word denoting unknown words.
24 |           annotations_file: Path for train annotation file.
25 |           vocab_from_file: If False, create vocab from scratch & override any existing vocab_file
26 |                            If True, load vocab from from existing vocab_file, if it exists
27 |         """
28 |         self.vocab_threshold = vocab_threshold
29 |         self.vocab_file = vocab_file
30 |         self.start_word = start_word
31 |         self.end_word = end_word
32 |         self.unk_word = unk_word
33 |         self.annotations_file = annotations_file
34 |         self.vocab_from_file = vocab_from_file
35 |         self.get_vocab()
36 | 
37 |     def get_vocab(self):
38 |         """Load the vocabulary from file OR build the vocabulary from scratch."""
39 |         if os.path.exists(self.vocab_file) & self.vocab_from_file:
40 |             with open(self.vocab_file, 'rb') as f:
41 |                 vocab = pickle.load(f)
42 |                 self.word2idx = vocab.word2idx
43 |                 self.idx2word = vocab.idx2word
44 |             print('Vocabulary successfully loaded from vocab.pkl file!')
45 |         else:
46 |             self.build_vocab()
47 |             with open(self.vocab_file, 'wb') as f:
48 |                 pickle.dump(self, f)
49 |         
50 |     def build_vocab(self):
51 |         """Populate the dictionaries for converting tokens to integers (and vice-versa)."""
52 |         self.init_vocab()
53 |         self.add_word(self.start_word)
54 |         self.add_word(self.end_word)
55 |         self.add_word(self.unk_word)
56 |         self.add_captions()
57 | 
58 |     def init_vocab(self):
59 |         """Initialize the dictionaries for converting tokens to integers (and vice-versa)."""
60 |         self.word2idx = {}
61 |         self.idx2word = {}
62 |         self.idx = 0
63 | 
64 |     def add_word(self, word):
65 |         """Add a token to the vocabulary."""
66 |         if not word in self.word2idx:
67 |             self.word2idx[word] = self.idx
68 |             self.idx2word[self.idx] = word
69 |             self.idx += 1
70 | 
71 |     def add_captions(self):
72 |         """Loop over training captions and add all tokens to the vocabulary that meet or exceed the threshold."""
73 |         coco = COCO(self.annotations_file)
74 |         counter = Counter()
75 |         ids = coco.anns.keys()
76 |         for i, id in enumerate(ids):
77 |             caption = str(coco.anns[id]['caption'])
78 |             tokens = nltk.tokenize.word_tokenize(caption.lower())
79 |             counter.update(tokens)
80 | 
81 |             if i % 100000 == 0:
82 |                 print("[%d/%d] Tokenizing captions..." % (i, len(ids)))
83 | 
84 |         words = [word for word, cnt in counter.items() if cnt >= self.vocab_threshold]
85 | 
86 |         for i, word in enumerate(words):
87 |             self.add_word(word)
88 | 
89 |     def __call__(self, word):
90 |         if not word in self.word2idx:
91 |             return self.word2idx[self.unk_word]
92 |         return self.word2idx[word]
93 | 
94 |     def __len__(self):
95 |         return len(self.word2idx)
96 | 


--------------------------------------------------------------------------------