├── Implementing Faster R-CNN.ipynb ├── README.md ├── download_checkpoint.sh ├── images ├── bicycles.jpg ├── cats.jpg ├── horse.jpg ├── kids.jpg ├── kittens.png └── woman.jpg ├── setup.py └── workshop ├── __init__.py ├── faster.py ├── image.py ├── io.py ├── resnet.py └── vis.py /Implementing Faster R-CNN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Implementing Faster R-CNN\n", 8 | "\n", 9 | "The objective of this activity is to implement the main parts of the Faster R-CNN algorithm.\n", 10 | "\n", 11 | "We will:\n", 12 | "* understand intuitively how the parts of the algorithm are working and how they fit together.\n", 13 | "* implement all the stages required for **inference** one by one, using Python, leveraging numpy and TensorFlow.\n", 14 | "* use **existing weights** of a Faster R-CNN model that has been trained on the [COCO Dataset](http://cocodataset.org/), to guide and facilitate the process.\n", 15 | "\n", 16 | "We will **NOT**:\n", 17 | "* implement any training code whatsoever. We won't code any loss function or deal with ground truth boxes.\n", 18 | "* train the model, as we already have weights for you that work.\n", 19 | "\n", 20 | "We've tried to keep code in the notebooks to a minimum, mainly data manipulation and visualization, to make it easy enough to follow. All accompanying code is under the `workshop` Python package.\n", 21 | "\n", 22 | "After some introductory code, the notebook will continue as follows:\n", 23 | "* Playing with a **pre-trained ResNet** to obtain features out of an image.\n", 24 | "* Generate regions of interest by implementing the **Region Proposal Network** detailed in [1].\n", 25 | "* Prepare this regions to be fed to the second stage, by applying **RoI pooling**.\n", 26 | "* Classify and refine said regions by passing them through an **R-CNN**, as detailed in [2].\n", 27 | "\n", 28 | "We'll present you with stubs for the different functions required and your task will be to fill them in.\n", 29 | "\n", 30 | "* [1] Ren, Shaoqing, et al. *Faster R-CNN: Towards real-time object detection with region proposal networks.*\n", 31 | "* [2] Girshick, Ross. *Fast R-CNN.*" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "---\n", 39 | "# The basics\n", 40 | "We'll start with some imports.\n", 41 | "\n", 42 | "The local imports are under the `workshop` package, which you should have installed using `pip install -e workshop/` in the environment you're running your notebook on.\n", 43 | "\n", 44 | "Within `workshop` we have some modules:\n", 45 | "* `vis`: various visualization utilities to draw bounding boxes, sliders, etc.\n", 46 | "* `image`: utilities for reading images and loading them into PIL (the imaging library).\n", 47 | "* `resnet`: the implementation for the base network we're going to use (more on this shortly).\n", 48 | "* `faster`: utilities and parts we won't be implementing but provide for completeness' sake.\n", 49 | "\n", 50 | "Let's test some things to make sure everything is up and ready to go.\n", 51 | "\n", 52 | "Start by running the following in your terminal, and then test the rest of the imports:\n", 53 | "```bash\n", 54 | " $ jupyter nbextension enable --py widgetsnbextension\n", 55 | " ```" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "from ipywidgets import interact, Checkbox, FloatSlider, Layout\n", 67 | "\n", 68 | "import json\n", 69 | "import matplotlib.pyplot as plt\n", 70 | "import numpy as np\n", 71 | "import os\n", 72 | "import tensorflow as tf\n", 73 | "import tensorflow.contrib.eager as tfe\n", 74 | "\n", 75 | "from PIL import Image\n", 76 | "\n", 77 | "\n", 78 | "# Try to enable TF eager execution, or do nothing if running again.\n", 79 | "try:\n", 80 | " tf.enable_eager_execution()\n", 81 | "except ValueError:\n", 82 | " # Already executed.\n", 83 | " pass\n", 84 | "\n", 85 | "\n", 86 | "# Local imports.\n", 87 | "from workshop.faster import (\n", 88 | " clip_boxes, rcnn_proposals, run_base_network, run_resnet_tail,\n", 89 | " generate_anchors_reference, sort_anchors\n", 90 | ")\n", 91 | "from workshop.image import open_all_images, open_image, to_image\n", 92 | "from workshop.vis import (\n", 93 | " add_rectangle, draw_bboxes, draw_bboxes_with_labels, image_grid,\n", 94 | " pager, vis_anchors\n", 95 | ")\n", 96 | "\n", 97 | "# Notebook-specific settings.\n", 98 | "%matplotlib inline" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "We'll now load some images to play with and display them below. Change which image is passed to the `to_image` function to see it in full size." 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "images = open_all_images('images/')\n", 115 | "\n", 116 | "axes = image_grid(len(images))\n", 117 | "for ax, (name, image) in zip(axes, images.items()):\n", 118 | " ax.imshow(np.squeeze(image))\n", 119 | " ax.set_title(name)\n", 120 | "\n", 121 | "plt.subplots_adjust(wspace=.01)\n", 122 | "plt.show()\n", 123 | "\n", 124 | "image = images['woman']\n", 125 | "\n", 126 | "# `to_image` turns a `numpy.ndarray` into a PIL image, so it's displayed by the notebook.\n", 127 | "to_image(image)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "source": [ 136 | "---\n", 137 | "# The base network: ResNet\n", 138 | "\n", 139 | "\n", 140 | "The basis for the Faster R-CNN algorithm is to leverage a pre-trained classifier network to extract feature maps (also called *activation maps*) from the image. For this implementation, we'll be using the popular ResNet 101 [3].\n", 141 | "\n", 142 | "We provide the implementation itself (which you can see in the `workshop.resnet` module), as well as a checkpoint with the pre-trained weights (in the `checkpoint/` directory).\n", 143 | "\n", 144 | "---\n", 145 | "\n", 146 | "### Aside\n", 147 | "\n", 148 | "The ResNet architecture consists of four stacked **blocks**, after which a fully-connected layer is attached. As is expected of CNNs, these blocks detect features from most simple to most complex. For this part of the algorithm, we're using the output of the **block 3**, so we get somewhat generic features. The intuition is that, if we go all the way and use block 4, we might have things that are too specific to the dataset used to pre-train the ResNet (the Imagenet dataset) and thus not as desirable for a network that wants to identify generic objects. \n", 149 | "\n", 150 | "---\n", 151 | "\n", 152 | "Run the base network on different images, in order to see how the different activation maps behave. **Can you notice any particular features being detected in the activation maps?**\n", 153 | "\n", 154 | "* [3] He, Kaiming, et al. *Deep residual learning for image recognition.*" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "collapsed": true 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "with tfe.restore_variables_on_create('checkpoint/fasterrcnn'):\n", 166 | " feature_map = run_base_network(image)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "scrolled": false 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "@interact(page=pager(1024, 20, 'Feature map'))\n", 178 | "def display_feature_maps(page):\n", 179 | " axes = image_grid(20)\n", 180 | " for idx, ax in enumerate(axes):\n", 181 | " if page * 20 + idx >= 1024:\n", 182 | " break\n", 183 | " ax.imshow(\n", 184 | " feature_map.numpy()[0, :, :, page * 20 + idx],\n", 185 | " cmap='gray', aspect='auto'\n", 186 | " )\n", 187 | "\n", 188 | " plt.subplots_adjust(wspace=.01, hspace=.01)\n", 189 | " plt.show()" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## Learn: understand what patterns activate particular filters.\n", 197 | "\n", 198 | "Let's overlay the feature maps into the images themselves, so we can take a more detailed look into what pattern makes the ResNet react.\n", 199 | "\n", 200 | "See, for example:\n", 201 | "* Feature map 171 in `woman`.\n", 202 | "* Feature maps 19, 22 in `cats`.\n", 203 | "* Feature map 34, 64 in `bicycles`.\n", 204 | "* Feature map 253 in `kids`." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "@interact(idx=pager(1024, 1, 'Feature map index'))\n", 214 | "def overlay_feature_map(idx):\n", 215 | " # Normalize the feature map so we get the whole range of colors.\n", 216 | " fm = (\n", 217 | " feature_map.numpy()[0, :, :, idx]\n", 218 | " / feature_map.numpy()[0, :, :, idx].max()\n", 219 | " * 255\n", 220 | " ).astype(np.uint8)\n", 221 | " \n", 222 | " # Resize the feature map without interpolation.\n", 223 | " fm_image = Image.fromarray(fm, mode='L').convert('RGBA')\n", 224 | " fm_image = fm_image.resize(image.shape[1:3][::-1], resample=Image.NEAREST)\n", 225 | " \n", 226 | " # Add some alpha to overlay it over the image.\n", 227 | " fm_image.putalpha(200)\n", 228 | " \n", 229 | " base_image = to_image(image)\n", 230 | " base_image.paste(fm_image, (0, 0), fm_image)\n", 231 | " \n", 232 | " return base_image" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "This section didn't require any implementation at all, but get ready, because we're about to. The main idea here was illustrating what we mean when we say that the later layers of a classification network are **feature detectors**, reacting to particular patterns in an image.\n", 240 | "\n", 241 | "What would you do if you were to use this information to detect objects? How could you leverage the fact that we can say \"there's a cat ear here!\"? We're going to explore these questions in the following sections.\n", 242 | "\n", 243 | "For now, back to the slides!" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "---\n", 251 | "# Finding stuff with the Region Proposal Network\n", 252 | "\n", 253 | "Having gone through the theory, we'll now turn our attention to implementing a **Region Proposal Network**. The idea, as we've seen, is to use the feature maps provided by the ResNet to find out **where** there might be an object located.\n", 254 | "\n", 255 | "This is where **anchors** come into play. We'll take a grid of points over the image and consider several anchors (also called \"reference boxes\" sometimes) for each of them (15 in this case). The RPN layers themselves will then predict whether there's an object in each of these 15 boxes **and** how much we need to resize them to better fit it.\n", 256 | "\n", 257 | "The tasks we have ahead of us are, thus:\n", 258 | "* Get the **coordinates** $(x_{min}, y_{min}, x_{max}, y_{max})$ for each of the anchors. There are $15$ anchors and the centers will be separated by approximately $16$ pixels, so we're talking about several thousand of coordinates.\n", 259 | "* Find out how to do the special **encoding** and **decoding** of coordinates described in the Faster R-CNN paper, so the RPN can predict locations in the image.\n", 260 | "* Build the **convolutional layers** comprising the RPN and run them through different images.\n", 261 | "* **Translate the predictions** of the RPN layer into usable proposals.\n", 262 | "\n", 263 | "### Note: coordinate conventions\n", 264 | "Except in specific cases, we'll be using the convention $(x_{min}, y_{min}, x_{max}, y_{max})$ to denote a bounding box, were $(x_{min}, y_{min})$ corresponds to the top-left point and $(x_{max}, y_{max})$ the bottom right.\n", 265 | "\n", 266 | "As usual with image processing, the origin of the coordinate system, $(0, 0)$, is on the top-left of the image." 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "## Generating anchors\n", 274 | "\n", 275 | "We'll get the anchor's coordinates in two steps. First, we will use a function called `generate_anchors_reference` which, given the anchors' settings (i.e. size, aspect ratio, scales), returns an array with the coordinates for the boxes (in pixel space) assuming they're centered around (0, 0). This will give us, effectively, a $(15, 4)$ array.\n", 276 | "\n", 277 | "There are three settings for the anchors:\n", 278 | "\n", 279 | "* `base_size`: **side length for a square anchor**, in pixels (e.g. 256). Increasing it makes the anchor cover more area of the image.\n", 280 | "* `scales`: **scale factors** to consider taking `base_size` as reference. For instance, a scale of `2` will make the effective size `512` if base size was `256`.\n", 281 | "* `aspect_ratios`: **aspect ratios** of the anchors, expressed as the value of `height / width`. Note that *changing the aspect ratio doesn't change the area the anchor covers*. An aspect ratio of `2` means that, for the area covered by a square anchor of of `base_size`, we should get a rectangle of twice the height than width.\n", 282 | "\n", 283 | "Let's see how this looks like using a single aspect ratio and 3 scales:" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "anchors_ref = generate_anchors_reference(\n", 293 | " 256, # Base size.\n", 294 | " [1], # Aspect ratios.\n", 295 | " [0.5, 1, 2], # Scales.\n", 296 | ")\n", 297 | "\n", 298 | "vis_anchors(anchors_ref)\n", 299 | "\n", 300 | "# Remember this is just a numpy array of shape\n", 301 | "# (total_aspect_ratios * total_scales, 4)\n", 302 | "# with the corner points of the reference anchors using the\n", 303 | "# convention (x_min, y_min, x_max, y_max).\n", 304 | "print('As a numpy array:')\n", 305 | "print(anchors_ref)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "Let's now change the aspect ratio, but keep the same scales:" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "anchors_ref = generate_anchors_reference(\n", 322 | " 256, # Base size.\n", 323 | " [0.5], # Aspect ratios.\n", 324 | " [0.5, 1, 2], # Scales.\n", 325 | ")\n", 326 | "\n", 327 | "vis_anchors(anchors_ref)\n", 328 | "\n", 329 | "print('As a numpy array:')\n", 330 | "print(anchors_ref)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "As we are now using $0.5$ as aspect ratio, it means the width/height relation for each anchor should be equal to that (so the rectangles are elongated).\n", 338 | "\n", 339 | "Now, let's try using a single scale, but varying the aspect ratios:" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "anchors_ref = generate_anchors_reference(\n", 349 | " 256, # Base size.\n", 350 | " [0.5, 1, 2], # Aspect ratios.\n", 351 | " [2], # Scales.\n", 352 | ")\n", 353 | "\n", 354 | "vis_anchors(anchors_ref)\n", 355 | "\n", 356 | "print('As a numpy array:')\n", 357 | "print(anchors_ref)" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "In this case, every anchor depicted here should have the same area, since they are all the same **scale** and only vary in their **aspect ratio**.\n", 365 | "\n", 366 | "Finally, let's generate the final set of **15 anchor references** centered around (0,0):" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "anchors_ref = generate_anchors_reference(\n", 376 | " 256, # Base size.\n", 377 | " [0.5, 1, 2], # Aspect ratios.\n", 378 | " [0.125, 0.25, 0.5, 1, 2], # Scales.\n", 379 | ")\n", 380 | "\n", 381 | "vis_anchors(anchors_ref)\n", 382 | "\n", 383 | "print('As a numpy array:')\n", 384 | "print(anchors_ref)" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "For some perspective, let's draw these anchors over **a single arbitrary point** in the image, to see how they match.\n", 392 | "\n", 393 | "\n", 394 | "Keep in mind that `anchors_ref` is a numpy array containing 15 values, where each one is a rectangle represented as $(x_{min}, y_{min}, x_{max}, y_{max})$.\n", 395 | "\n", 396 | "\n", 397 | "Since the anchor references are **centered around $(0, 0)$**, it is easy to translate them over any point $P$ by just adding up the coordinates: the anchor $(x_{min}, y_{min}, x_{max}, y_{max})$ at point $P = (x_p, y_p)$ would be specified by the coordinates $(x_{min} + x_p, y_{min} + y_p, x_{max} + x_p, y_{max} + y_p)$.\n", 398 | "\n", 399 | "Let's see how this looks like at point $(400, 270)$:" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "point = np.array([400, 270])\n", 409 | "\n", 410 | "# Sum the point on both the *_min and the *_max parts.\n", 411 | "anchors_at_point = anchors_ref + np.concatenate([point, point])\n", 412 | "\n", 413 | "_, ax = plt.subplots(1, figsize=(16, 20))\n", 414 | "ax.imshow(to_image(image))\n", 415 | "\n", 416 | "# Add a buffer around the image so we see the whole anchor references.\n", 417 | "ax.set_xlim([-100, image.shape[2] + 100])\n", 418 | "ax.set_ylim([image.shape[1] + 100, -100])\n", 419 | "\n", 420 | "for idx in range(anchors_at_point.shape[0]):\n", 421 | " add_rectangle(ax, anchors_at_point[idx, :])\n", 422 | "\n", 423 | "# Plot the reference point in use.\n", 424 | "ax.plot(point[0], point[1], marker='s', color='#dc3912', markersize=3)\n", 425 | "\n", 426 | "plt.show()" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "As you can see, the larger boxes cover quite a bit of the image, while the smaller ones will be useful for detecting very small objects.\n", 434 | "\n", 435 | "\n", 436 | "Now, our **first real coding task** (yes!) will be to do the same with the anchor references over each of the **anchor centers** in the image.\n", 437 | "\n", 438 | "\n", 439 | "Given that we're using a ResNet 101, which has a downsampling factor of 16 (i.e. every point in the feature map --block 3 as we said-- corresponds to a $16\\times16$ region of the original image), we'll select the centers **every 16 pixels** in each direction.\n", 440 | "\n", 441 | "For reference, the anchor centers are visualized below." 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": null, 447 | "metadata": {}, 448 | "outputs": [], 449 | "source": [ 450 | "# This is actually defined within `run_base_network`, but for visualization\n", 451 | "# purposes, we're defining it again.\n", 452 | "OUTPUT_STRIDE = 16\n", 453 | "\n", 454 | "# Print the anchor centers in use.\n", 455 | "_, ax = plt.subplots(1, figsize=(16, 20))\n", 456 | "\n", 457 | "ax.imshow(to_image(image))\n", 458 | "ax.set_xlim([-100, image.shape[2] + 100])\n", 459 | "ax.set_ylim([image.shape[1] + 100, -100])\n", 460 | "\n", 461 | "for x in range(0, image.shape[2], OUTPUT_STRIDE):\n", 462 | " for y in range(0, image.shape[1], OUTPUT_STRIDE):\n", 463 | " ax.plot(x, y, marker='s', color='#dc3912', markersize=3)\n", 464 | "\n", 465 | "plt.show()" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "Let's wrap this part by getting the entire list of anchors for the image. This will be done within the `generate_anchors` function.\n", 473 | "\n", 474 | "### Programming task: implement `generate_anchors` function." 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "# These are the anchor properties that will be used in our implementation.\n", 484 | "# Compared to the values picked in the original Faster R-CNN paper, we've\n", 485 | "# added two smaller scales that help the model detect the smaller objects\n", 486 | "# present in the COCO dataset.\n", 487 | "ANCHOR_BASE_SIZE = 256\n", 488 | "ANCHOR_RATIOS = [0.5, 1, 2]\n", 489 | "ANCHOR_SCALES = [0.125, 0.25, 0.5, 1, 2]\n", 490 | "\n", 491 | "\n", 492 | "def generate_anchors(feature_map_shape):\n", 493 | " \"\"\"Generate anchors for an image.\n", 494 | "\n", 495 | " Using the feature map (the output of the pretrained network for an image)\n", 496 | " and the anchor references (generated using the specified anchor sizes and\n", 497 | " ratios), we generate a list of anchors.\n", 498 | "\n", 499 | " Anchors are just fixed bounding boxes of different ratios and sizes that\n", 500 | " are uniformly generated throughout the image.\n", 501 | "\n", 502 | " Arguments:\n", 503 | " feature_map_shape: Shape of the convolutional feature map used as\n", 504 | " input for the RPN.\n", 505 | " Should be (batch, feature_height, feature_width, depth).\n", 506 | "\n", 507 | " Returns:\n", 508 | " all_anchors: A Tensor with the anchors at every spatial position, of\n", 509 | " shape `(feature_height, feature_width, num_anchors_per_points, 4)`\n", 510 | " using the (x1, y1, x2, y2) convention.\n", 511 | " \"\"\"\n", 512 | "\n", 513 | " anchor_reference = generate_anchors_reference(\n", 514 | " ANCHOR_BASE_SIZE, ANCHOR_RATIOS, ANCHOR_SCALES\n", 515 | " )\n", 516 | " \n", 517 | " # Tip: first, implement it with regular Python loops.\n", 518 | " #\n", 519 | " # If you have time and want to try to do it in a vectorized way, the\n", 520 | " # following functions might be of use: `tf.meshgrid`, `tf.range`,\n", 521 | " # `tf.expand_dims` and `tf.transpose`.\n", 522 | " \n", 523 | " ####\n", 524 | " # Fill this function below, paying attention to the docstring.\n", 525 | " ####\n", 526 | " \n", 527 | " ####\n", 528 | "\n", 529 | " return all_anchors\n", 530 | "\n", 531 | "\n", 532 | "anchors = tf.reshape(generate_anchors(feature_map.shape), [-1, 4])\n", 533 | "\n", 534 | "print('Anchors (real image size):')\n", 535 | "print()\n", 536 | "print(anchors.numpy())" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "Let's draw the anchors over an arbitrary point to corroborate that the results makes sense." 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": null, 549 | "metadata": {}, 550 | "outputs": [], 551 | "source": [ 552 | "# Visualize the anchors. Try changing to different points of the image.\n", 553 | "# Note that we're referring to _positions in the feature map_ here, so the\n", 554 | "# actual point in the image will be around `OUTPUT_STRIDE` times the value.\n", 555 | "point = np.array([30, 20])\n", 556 | "\n", 557 | "# Reshape back to (H, W, num_anchors, 4) so we can easily get a given point's anchors.\n", 558 | "anchors_at_point = anchors.numpy().reshape(\n", 559 | " (feature_map.shape[1], feature_map.shape[2], 15, 4)\n", 560 | ")[point[1], point[0], :, :]\n", 561 | "\n", 562 | "_, ax = plt.subplots(1, figsize=(16, 20))\n", 563 | "\n", 564 | "ax.imshow(to_image(image))\n", 565 | "ax.set_xlim([-100, image.shape[2] + 100])\n", 566 | "ax.set_ylim([image.shape[1] + 100, -100])\n", 567 | "\n", 568 | "for idx in range(anchors_at_point.shape[0]):\n", 569 | " add_rectangle(ax, anchors_at_point[idx, :])\n", 570 | "\n", 571 | "# Plot the reference point in use.\n", 572 | "ax.plot(\n", 573 | " point[0] * OUTPUT_STRIDE,\n", 574 | " point[1] * OUTPUT_STRIDE,\n", 575 | " marker='s', color='#dc3912', markersize=3\n", 576 | ")\n", 577 | "\n", 578 | "plt.show()" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "With this we've finished generating the anchors that will be used by the RPN. This is, effectively, a list of $15 \\times F_x \\times F_y$, where $F_x, F_y$ are the feature map width and height, respectively." 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "## Encoding and decoding bounding box coordinates\n", 593 | "\n", 594 | "\n", 595 | "Deep neural networks usually train and converge better when their outputs have zero mean and unit variance (and/or their intermediate values do so). Due to this, and the difficulty in predicting values in a possibly unbounded region (pixel coordinates), a special encoding is applied to the coordinates before passing them in to the network (and after getting them out).\n", 596 | "\n", 597 | "The idea behind the encoding is to express the coordinates of a bounding box $B$ as a set of four numbers $(D_x, D_y, D_w, D_h)$ (the **deltas**) and a reference anchor $R$. $D_x$ and $D_y$ indicate how much the center of $R$ should be moved to reach the center of $B$, normalized by the size of $R$, while $D_w$ and $D_h$ indicate how much the width and height of $R$ must be increased or decreased to reach the size of $B$ (it's actually the log of that value, as you'll see below).\n", 598 | "\n", 599 | "For the following equations, we change from the $(x_{min}, y_{min}, x_{max}, y_{max})$ encoding to the **center+dimensions encoding** $(x, y, w, h)$, where $(x, y)$ are the **center coordinates**, and $(w, h)$ the **width and height**. The equations to encode $B = (x_b, y_b, w_b, h_b)$ with respect to anchor $R = (x_r, y_r, w_r, h_r)$ are, then:\n", 600 | "\n", 601 | "$D_x = \\frac{x_b - x_r}{w_r} \\quad$\n", 602 | "$D_y = \\frac{y_b - y_r}{h_r} \\quad$\n", 603 | "$D_w = log \\frac{w_b}{w_r} \\quad$\n", 604 | "$D_h = log \\frac{h_b}{h_r}$\n", 605 | "\n", 606 | "The equations to decode $B = (x_b, y_b, w_b, h_b)$ given $R = (x_r, y_r, w_r, h_r)$ and deltas $D = (D_x, D_y, D_w, D_h)$ are:\n", 607 | "\n", 608 | "$x_b = D_x w_r + x_r \\quad$\n", 609 | "$y_b = D_y h_r + y_r \\quad$\n", 610 | "$w_b = e^{D_w} w_r \\quad$\n", 611 | "$h_b = e^{D_h} h_r \\quad$\n", 612 | "\n", 613 | "We'll implement two functions here, `encode` and `decode`. While only the latter will be used, it's useful to implement both in order to understand the whole process and to make it easier to test." 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "metadata": {}, 619 | "source": [ 620 | "### Programming task: implement `get_dimensions_and_center` function." 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": { 627 | "collapsed": true 628 | }, 629 | "outputs": [], 630 | "source": [ 631 | "# You might find it useful to implement the following function first in order\n", 632 | "# to obtain the dimensions (width and height) and center of a bounding box,\n", 633 | "# required for calculating the deltas in `encode` and `decode`.\n", 634 | "def get_dimensions_and_center(bboxes):\n", 635 | " \"\"\"Obtain width, height and center coordinates of a bounding box.\n", 636 | " \n", 637 | " Arugments:\n", 638 | " bboxes: Tensor of shape (num_bboxes, 4).\n", 639 | " \n", 640 | " Returns:\n", 641 | " Tuple of Tensors of shape (num_bboxes,) with the values\n", 642 | " width, height, center_x and center_y corresponding to each\n", 643 | " bounding box.\n", 644 | " \"\"\"\n", 645 | " \n", 646 | " # Tip: Fully read the docstring above.\n", 647 | " # Tip: You may find the Tensorflow function `tf.split` useful.\n", 648 | "\n", 649 | " ####\n", 650 | " # Fill this function below, paying attention to the docstring.\n", 651 | " ####\n", 652 | " \n", 653 | " ####\n", 654 | "\n", 655 | " return width, height, ctx, cty" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "### Programming task: implement `encode` function and play around with the checks at the bottom." 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": null, 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [ 671 | "def encode(anchors, bboxes):\n", 672 | " \"\"\"Encode bounding boxes as deltas w.r.t. anchors.\n", 673 | " \n", 674 | " Arguments:\n", 675 | " anchors: Tensor of shape (num_bboxes, 4). With the same bbox\n", 676 | " encoding.\n", 677 | " bboxes: Tensor of shape (num_bboxes, 4). Having the bbox\n", 678 | " encoding in the (x_min, y_min, x_max, y_max) order.\n", 679 | "\n", 680 | " Returns:\n", 681 | " Tensor of shape (num_bboxes, 4) with the different deltas needed\n", 682 | " to transform `anchors` to `bboxes`. These deltas are with\n", 683 | " regard to the center, width and height of the two boxes.\n", 684 | " \"\"\"\n", 685 | " \n", 686 | " ####\n", 687 | " # Fill this function below, paying attention to the docstring.\n", 688 | " ####\n", 689 | "\n", 690 | " ####\n", 691 | "\n", 692 | " return deltas\n", 693 | "\n", 694 | "\n", 695 | "# Encoding `bbox` with respect to an anchor having the same center\n", 696 | "# should keep the first two deltas at zero.\n", 697 | "anchor = np.array([[0, 0, 100, 100]], dtype=np.float32)\n", 698 | "bbox = np.array([[25, 25, 75, 75]], dtype=np.float32)\n", 699 | "print('With same center, first two deltas should be zero:\\n', encode(anchor, bbox).numpy())\n", 700 | "print()\n", 701 | "\n", 702 | "# Encoding `bbox` with respect to an anchor having the same size\n", 703 | "# should keep the last two deltas at zero.\n", 704 | "anchor = np.array([[0, 0, 100, 100]], dtype=np.float32)\n", 705 | "bbox = np.array([[50, 50, 150, 150]], dtype=np.float32)\n", 706 | "print('Same size, last two deltas should be zero:\\n', encode(anchor, bbox).numpy())\n", 707 | "\n", 708 | "# What other ways to check the functions can you think of?" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "### Programming task: implement `decode` function, play around with the checks at the bottom.\n", 716 | "### Then, verify that that the round trip `encode -> decode` works as expected." 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": null, 722 | "metadata": {}, 723 | "outputs": [], 724 | "source": [ 725 | "def decode(anchors, deltas):\n", 726 | " \"\"\"Decode bounding boxes by applying deltas to anchors.\n", 727 | " \n", 728 | " Arguments:\n", 729 | " anchors: Tensor of shape (num_bboxes, 4). Having the bbox\n", 730 | " encoding in the (x_min, y_min, x_max, y_max) order.\n", 731 | " deltas: Tensor of shape (num_bboxes, 4). Deltas (as returned by\n", 732 | " `encode`) that we want to apply to `bboxes`.\n", 733 | "\n", 734 | " Returns:\n", 735 | " Tensor of shape (num_bboxes, 4) with the decoded proposals,\n", 736 | " obtained by applying `deltas` to `anchors`.\n", 737 | " \"\"\"\n", 738 | " \n", 739 | " ####\n", 740 | " # Fill this function below, paying attention to the docstring.\n", 741 | " ####\n", 742 | "\n", 743 | " ####\n", 744 | "\n", 745 | " return bboxes\n", 746 | "\n", 747 | "\n", 748 | "# Decoding `anchor` with zero `deltas` should keep the box as-is.\n", 749 | "anchor = np.array([[25, 25, 75, 75]], dtype=np.float32)\n", 750 | "delta = np.array([[0, 0, 0, 0]], dtype=np.float32)\n", 751 | "print('Zero delta, should get a bounding box with same dimensions:')\n", 752 | "print('\\tAnchor:', anchor)\n", 753 | "print('\\tBounding box:', decode(anchor, delta).numpy())\n", 754 | "print()\n", 755 | "\n", 756 | "# Applying a `delta` with two ones at first then two zeros to an `anchor`\n", 757 | "# should get a bounding box of the size but moved to the right one-width.\n", 758 | "anchor = np.array([[25, 25, 75, 75]], dtype=np.float32)\n", 759 | "delta = np.array([[1, 1, 0, 0]], dtype=np.float32)\n", 760 | "print('First-two are ones, obtained box moved to the right one width:')\n", 761 | "print('\\tAnchor:', anchor)\n", 762 | "print('\\tBounding box:', decode(anchor, delta).numpy())\n", 763 | "print()\n", 764 | "\n", 765 | "# Decoding `anchor` with two zeros at first then two ones at `deltas`\n", 766 | "# should get a larger bounding box while maintaining the center.\n", 767 | "anchor = np.array([[25, 25, 75, 75]], dtype=np.float32)\n", 768 | "delta = np.array([[0, 0, 1, 1]], dtype=np.float32)\n", 769 | "print('Last-two are ones, center should be the same:')\n", 770 | "print('\\tAnchor:', anchor)\n", 771 | "print('\\tBounding box:', decode(anchor, delta).numpy())\n", 772 | "\n", 773 | "# What other ways to check the functions can you think of? How can\n", 774 | "# you pick the deltas so that it exactly doubles in size?" 775 | ] 776 | }, 777 | { 778 | "cell_type": "markdown", 779 | "metadata": {}, 780 | "source": [ 781 | "Let's test the round-trip of `encode` and `decode`, to see if they're consistent between them." 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": null, 787 | "metadata": {}, 788 | "outputs": [], 789 | "source": [ 790 | "# Test the round-trip: encode `bboxes` w.r.t. the anchors `anchors`,\n", 791 | "# which gives us the deltas that transform `anchors` into `bboxes`.\n", 792 | "# Then decode the `anchors` with said deltas to see that, effectively,\n", 793 | "# we get `bboxes` back.\n", 794 | "anchor = np.array([\n", 795 | " [0, 0, 100, 100],\n", 796 | "], dtype=np.float32)\n", 797 | "\n", 798 | "# You can try out other bounding boxes, just make sure to respect the\n", 799 | "# convention of first putting (x_min, y_min) then (x_max, y_max), or\n", 800 | "# you may get an invalid bounding box.\n", 801 | "bboxes = np.array([\n", 802 | " [25, 25, 75, 75],\n", 803 | " [10, -205, 120, 20],\n", 804 | " [-35, 37, 38, 100],\n", 805 | " [-0.2, -0.2, 0.2, 0.2],\n", 806 | " [-25, -50, -5, -20],\n", 807 | "], dtype=np.float32)\n", 808 | "\n", 809 | "print(\n", 810 | " 'Round-trip looks good:',\n", 811 | " np.sum(np.abs(\n", 812 | " decode(anchor, encode(anchor, bboxes)) - bboxes\n", 813 | " )) < 1e-3\n", 814 | ")" 815 | ] 816 | }, 817 | { 818 | "cell_type": "markdown", 819 | "metadata": {}, 820 | "source": [ 821 | "If you have time left at the end, you could try to gain further intuition on what they do and what the encoding's edge cases and limitations are by looking at more examples and plotting the deltas as `bboxes` moves through the image." 822 | ] 823 | }, 824 | { 825 | "cell_type": "markdown", 826 | "metadata": {}, 827 | "source": [ 828 | "## Convolutional layers\n", 829 | "\n", 830 | "We now have a variable-size feature map (a factor of 16 times spatially smaller than the original image) and we want to predict, for each spatial position, how to modify (i.e. the $4$ values from above, $D_{x, y, w, h}$) each of the $k = 15$ anchors. In this context, it makes sense to use a convolutional layer (or more) on the feature map, where the final number of filters will be $4 \\times k$.\n", 831 | "\n", 832 | "For each of these anchors we'll also want to decide whether we think there's an object present on said region or not (thus, $2 \\times k$ more filters). This will, in essence, look at the activation maps we saw before and decide whether, in a given region, the activated features amount to an object being in there (e.g. many _cat face_ features have been activated, so there's probably an object in that region).\n", 833 | "\n", 834 | "As we saw in the slides, the RPN first has a $3\\times3$ convolutional layer with $512$ filters and then two outputs heads:\n", 835 | "* One with $2 \\times k$ filters for the **objectness score**.\n", 836 | "* One with $4 \\times k$ filters for the **encoded deltas**.\n", 837 | "\n", 838 | "Both will be implemented as $1 \\times 1$ convolutions in order to support variable-size images." 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "metadata": {}, 844 | "source": [ 845 | "### Programming task: implement `run_rpn` function (the forward pass of a RPN)." 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": null, 851 | "metadata": { 852 | "collapsed": true 853 | }, 854 | "outputs": [], 855 | "source": [ 856 | "# Note that when implementing Faster R-CNN for training, we should\n", 857 | "# also specify initializers and regularizers for the weights. We're\n", 858 | "# omitting them here for brevity.\n", 859 | "\n", 860 | "def run_rpn(feature_map):\n", 861 | " \"\"\"Run the RPN layers through the feature map.\n", 862 | " \n", 863 | " Will run the input through an initial convolutional layer of\n", 864 | " filter size 3x3 and 512 channels, using athe ReLU6 activation.\n", 865 | " The output of this layer has the same spatial size as the\n", 866 | " input.\n", 867 | " \n", 868 | " Then run two 1x1 convolutions over this intermediate layer, one\n", 869 | " for the resizings and one for the objectness probabilities.\n", 870 | " Remember to apply the softmax function over the objectness\n", 871 | " scores to get a probability distribution.\n", 872 | " \n", 873 | " Arguments:\n", 874 | " feature_map: Tensor of shape (1, W, H, C), with WxH the\n", 875 | " spatial shape of the feature map and C the number of\n", 876 | " channels (1024 in this case).\n", 877 | " \n", 878 | " Returns:\n", 879 | " Tuple of Tensors, with the first being the output of the bbox\n", 880 | " resizings `(W * H * num_anchors, 4)` while the second being\n", 881 | " the objectness score, of size `(W * H * num_anchors, 2)`.\n", 882 | " \"\"\"\n", 883 | " \n", 884 | " # Tip: Read the docstring thoroughly to help you pass the correct\n", 885 | " # parameters to the conv layers, especially padding (you want to\n", 886 | " # keep the *same* spatial size after the initial conv layer).\n", 887 | " \n", 888 | " # Tip: See the functions `tf.layers.conv2d` and `tf.reshape`. Also\n", 889 | " # see `tf.nn.softmax` for the softmax function.\n", 890 | " \n", 891 | " # The names of the layers should be: `rpn/conv` for the base layer,\n", 892 | " # `rpn/cls_conv` for the objectness score, and `rpn/bbox_conv` for\n", 893 | " # the bbox resizing.\n", 894 | " \n", 895 | " ####\n", 896 | " # Fill this function below, paying attention to the docstring.\n", 897 | " ####\n", 898 | " \n", 899 | " ####\n", 900 | "\n", 901 | " return rpn_bbox_pred, rpn_cls_prob\n", 902 | "\n", 903 | "\n", 904 | "with tfe.restore_variables_on_create('checkpoint/fasterrcnn'):\n", 905 | " rpn_bbox_pred, rpn_cls_prob = run_rpn(feature_map)\n", 906 | " \n", 907 | "\n", 908 | "expected_preds = (\n", 909 | " feature_map.shape[1]\n", 910 | " * feature_map.shape[2]\n", 911 | " * len(ANCHOR_RATIOS)\n", 912 | " * len(ANCHOR_SCALES)\n", 913 | ")\n", 914 | "\n", 915 | "assert rpn_bbox_pred.shape[0] == expected_preds, 'Number of proposals should match'\n", 916 | "assert rpn_cls_prob.shape[0] == expected_preds, 'Number of proposals should match'\n", 917 | "\n", 918 | "assert rpn_bbox_pred.shape[1] == 4, 'There should be one delta per bbox coordinate (i.e., four)'\n", 919 | "assert rpn_cls_prob.shape[1] == 2, 'The objectness score should have two outputs'" 920 | ] 921 | }, 922 | { 923 | "cell_type": "markdown", 924 | "metadata": {}, 925 | "source": [ 926 | "## Generating proposals from the RPN output\n", 927 | "\n", 928 | "We now have the RPN layers outputs as-is. These will be the basis for *regions of interest* that will go through to the next stage of the object detection pipeline.\n", 929 | "\n", 930 | "Remember that the RPN layers outputs are the **encoded deltas**. So we need to get them back to image pixel space. When we do this, we can visualize what we have so far!\n", 931 | "\n", 932 | "First we decode the outputs of the RPN using our previously-implemented `decode` function, obtaining **proposals**. We also get a single-dimension **objectness score** for each of these proposals." 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "metadata": { 939 | "collapsed": true 940 | }, 941 | "outputs": [], 942 | "source": [ 943 | "# Generate proposals from the RPN's output by decoding the bounding boxes\n", 944 | "# according to the configured anchors.\n", 945 | "proposals = decode(anchors, rpn_bbox_pred)\n", 946 | "\n", 947 | "# Get the (positive-object) scores from the RPN.\n", 948 | "scores = tf.reshape(rpn_cls_prob[:, 1], [-1])" 949 | ] 950 | }, 951 | { 952 | "cell_type": "markdown", 953 | "metadata": {}, 954 | "source": [ 955 | "Keep in mind that we will have **more than 22k proposals** as output, and most will actually be garbage.\n", 956 | "\n", 957 | "In order to visualize what we have so far, we need to first sort them by score, and only keep those with the highest score.\n", 958 | "\n", 959 | "### Programming task: implement `keep_top_n` function." 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": null, 965 | "metadata": { 966 | "collapsed": true 967 | }, 968 | "outputs": [], 969 | "source": [ 970 | "def keep_top_n(proposals, scores, topn):\n", 971 | " \"\"\"Keeps only the top `topn` proposals, ordered by score.\n", 972 | " \n", 973 | " Arguments:\n", 974 | " proposals: Tensor of shape (num_proposals, 4), holding the\n", 975 | " coordinates of the proposals' bounding boxes.\n", 976 | " scores: Tensor of shape (num_proposals,), holding the\n", 977 | " scores associated to each bounding box.\n", 978 | " topn (int): Number of proposals to keep.\n", 979 | " \n", 980 | " Returns:\n", 981 | " (`min(num_proposals, topn)`, `scores`) ordered by score.\n", 982 | " \"\"\"\n", 983 | "\n", 984 | " # Tip: See `tf.minimum`, `tf.nn.top_k` to get the top values, and\n", 985 | " # `tf.gather` to select indices out of a Tensor.\n", 986 | " \n", 987 | " ####\n", 988 | " # Fill this function below, paying attention to the docstring.\n", 989 | " ####\n", 990 | " \n", 991 | " ####\n", 992 | " \n", 993 | " return sorted_top_proposals, sorted_top_scores" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "metadata": {}, 999 | "source": [ 1000 | "### Learn: play around with displaying a different number of proposals\n", 1001 | "\n", 1002 | "Then, answer the following questions:\n", 1003 | "\n", 1004 | "1. What do you see?\n", 1005 | "2. Why does it happen? Does it make sense?\n", 1006 | "3. What problem or problems do we have?\n", 1007 | "4. What would happen if we had initialized the network with random weights instead of using a pre-trained checkpoint?\n", 1008 | "5. How could we fix these issues?" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": {}, 1015 | "outputs": [], 1016 | "source": [ 1017 | "topn = 3000\n", 1018 | "\n", 1019 | "top_raw_proposals, top_raw_scores = keep_top_n(proposals, scores, topn)\n", 1020 | "# Display the first `topn` proposals, as ordered by score.\n", 1021 | "@interact(\n", 1022 | " topn=pager(500, 1, min=1, value=10, description='Number of proposals')\n", 1023 | ")\n", 1024 | "def draw(topn):\n", 1025 | " print('Minimum score: {:.2f}'.format(top_raw_scores[topn]))\n", 1026 | " return draw_bboxes(image, top_raw_proposals[:topn])" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "markdown", 1031 | "metadata": {}, 1032 | "source": [ 1033 | "### Learn: making sense of the RPN deltas\n", 1034 | "\n", 1035 | "Let's plot a histogram of the bounding box modifications (the deltas) for our current image.\n", 1036 | "\n", 1037 | "Look at the results. Do they make sense? Does it seem that the encoding is indeed helping unbias the predictions? What do values near zero mean?" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "code", 1042 | "execution_count": null, 1043 | "metadata": {}, 1044 | "outputs": [], 1045 | "source": [ 1046 | "preds = rpn_bbox_pred.numpy()\n", 1047 | "\n", 1048 | "_, axes = plt.subplots(2, 2, figsize=(16, 12))\n", 1049 | "for idx, ax in enumerate(axes.ravel()):\n", 1050 | " title = ['D_x', 'D_y', 'D_w', 'D_h'][idx]\n", 1051 | " ax.set_title(title)\n", 1052 | " ax.hist(preds[:, idx], bins=50)\n", 1053 | " \n", 1054 | "plt.show()" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "markdown", 1059 | "metadata": {}, 1060 | "source": [ 1061 | "Let's also plot the objectness scores. As you'll see, most of the anchors are deemed not worthy." 1062 | ] 1063 | }, 1064 | { 1065 | "cell_type": "code", 1066 | "execution_count": null, 1067 | "metadata": {}, 1068 | "outputs": [], 1069 | "source": [ 1070 | "preds = rpn_cls_prob.numpy()[:, 1]\n", 1071 | "\n", 1072 | "_, ax = plt.subplots(1, figsize=(16, 6))\n", 1073 | "ax.set_title('Scores (0 = no object, 1 = object)')\n", 1074 | "ax.hist(preds, bins=100)\n", 1075 | "\n", 1076 | "print('{} predictions over 0.9, out of a total of {}'.format(\n", 1077 | " len(np.flatnonzero(preds > 0.9)), len(preds)\n", 1078 | "))\n", 1079 | "print()\n", 1080 | " \n", 1081 | "plt.show()" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "metadata": {}, 1087 | "source": [ 1088 | "If you have some time left, it may prove insightful to analyze other statistics, such as the objectness and/or resizing by anchor size, or by position in the image. Performing an analysis like this can help pick hyperparameters, guide improvements for the algorithms and find pathologies on the architecture." 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "markdown", 1093 | "metadata": {}, 1094 | "source": [ 1095 | "## Filtering proposals\n", 1096 | "\n", 1097 | "As we saw above, it would be smart to implement a stage where we **filter** the proposals that we have, in order to perform better object detection in a later phase.\n", 1098 | "\n", 1099 | "* Some of the proposals may end up being invalid, as no constraints have been placed on the resizings (aside from the regularization induced by the encoding). For instance, we may end up with **zero-area proposals**, or with the extremes flipped. This may be especially true when we're training the algorithm from scratch (with randomly initialized weights), but we're going to filter them just in case.\n", 1100 | "* Many of the proposals may end up being **very similar to each other**. Due to this, we're going to apply an operation called **non-maximum suppression** to keep only those proposals that are most different to each other, and enable lower-score but more diverse proposals to get into our final set (to improve the quality of our detections).\n", 1101 | "\n", 1102 | "\n", 1103 | "First, we will plot the area per proposal in order to visualize how it is distributed, and see if we have some proposals with zero or negative area (in this case, negative area means that the bounding box extremes were flipped). As we said before, it is very much possible that since we're using fully-trained weights, no proposals with negative area are present. You might want to see the `encode` and `decode` functions you implemented above to see exactly when it can go negative." 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "code", 1108 | "execution_count": null, 1109 | "metadata": {}, 1110 | "outputs": [], 1111 | "source": [ 1112 | "props = proposals.numpy()\n", 1113 | "areas = (props[:, 2] - props[:, 0]) * (props[:, 3] - props[:, 1])\n", 1114 | "\n", 1115 | "_, ax = plt.subplots(1, figsize=(16, 6))\n", 1116 | "ax.set_title('Area per proposal')\n", 1117 | "ax.hist(areas, bins=100)\n", 1118 | "\n", 1119 | "plt.show()\n", 1120 | "\n", 1121 | "print('Proposals with areas under zero:')\n", 1122 | "print(np.flatnonzero(areas <= 0))" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "markdown", 1127 | "metadata": {}, 1128 | "source": [ 1129 | "### [Optional] Programming task: implement `filter_proposals` function." 1130 | ] 1131 | }, 1132 | { 1133 | "cell_type": "code", 1134 | "execution_count": null, 1135 | "metadata": { 1136 | "collapsed": true 1137 | }, 1138 | "outputs": [], 1139 | "source": [ 1140 | "# N.B.: You might as well skip this step if you're running out of time and\n", 1141 | "# there are no proposals with area under zero, but beware that on a real\n", 1142 | "# implementation ignoring this will cause trouble.\n", 1143 | "\n", 1144 | "def filter_proposals(proposals, scores):\n", 1145 | " \"\"\"Filters non-positive area proposals.\n", 1146 | " \n", 1147 | " Arguments:\n", 1148 | " proposals: Tensor of shape (num_proposals, 4), holding the\n", 1149 | " coordinates of the proposals' bounding boxes.\n", 1150 | " scores: Tensor of shape (num_proposals,), holding the\n", 1151 | " scores associated to each bounding box.\n", 1152 | " \n", 1153 | " Returns:\n", 1154 | " (`proposals`, `scores`), but with non-positive area proposals removed.\n", 1155 | " \"\"\"\n", 1156 | " \n", 1157 | " # Tip: see `tf.greater`, `tf.maximum`, `tf.boolean_mask`.\n", 1158 | " \n", 1159 | " ####\n", 1160 | " # Fill this function below, paying attention to the docstring.\n", 1161 | " ####\n", 1162 | " \n", 1163 | " ####\n", 1164 | "\n", 1165 | " return proposals, scores\n", 1166 | "\n", 1167 | "\n", 1168 | "# Filter proposals with negative areas.\n", 1169 | "proposals, scores = filter_proposals(proposals, scores)" 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "markdown", 1174 | "metadata": {}, 1175 | "source": [ 1176 | "## Removing redundancy: non-maximum supression" 1177 | ] 1178 | }, 1179 | { 1180 | "cell_type": "markdown", 1181 | "metadata": {}, 1182 | "source": [ 1183 | "Now we're going to use non-maximum suppression on the list of proposals we have. The end result will be a reduced list of proposals (in fact, of size `POST_NMS_TOP_N` defined below), ordered by objectness score, with some redundancy removed (that is, proposals that are too similar to each other will be discarded).\n", 1184 | "\n", 1185 | "As explained in [1], NMS greedily selects a subset of bounding boxes in descending order of score, pruning away boxes that have high intersection-over-union (IOU) [2] overlap with previously selected boxes.\n", 1186 | "\n", 1187 | "We'll be using `NMS_THRESHOLD` as the **IOU overlap threshold**. Also, in order to speed up the NMS (as we may have tens of thousands of proposals, depending on the image size), we'll first limit our proposal list to the top `PRE_NMS_TOP_N` proposals ordered by score.\n", 1188 | "\n", 1189 | "We'll use an already-implemented Tensorflow function for NMS itself. While this avoids the need to code the algorithm, we need to prepare the parameters correctly to feed it.\n", 1190 | "\n", 1191 | "You can read more about non-maximum suppression [here](https://www.pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-python/) and [here](https://www.pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/).\n", 1192 | "\n", 1193 | "* [1] https://www.tensorflow.org/api_docs/python/tf/image/non_max_suppression\n", 1194 | "* [2] https://en.wikipedia.org/wiki/Jaccard_index" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": null, 1200 | "metadata": { 1201 | "collapsed": true 1202 | }, 1203 | "outputs": [], 1204 | "source": [ 1205 | "# Limit of the initial proposal list, to reduce the number of proposals fed to\n", 1206 | "# non-maximum suppression.\n", 1207 | "PRE_NMS_TOP_N = 12000\n", 1208 | "\n", 1209 | "# We will use the `keep_top_n` function that you have implemented before!\n", 1210 | "proposals, scores = keep_top_n(proposals, scores, PRE_NMS_TOP_N)" 1211 | ] 1212 | }, 1213 | { 1214 | "cell_type": "markdown", 1215 | "metadata": {}, 1216 | "source": [ 1217 | "With the proposals pre-filtered, let's now apply NMS." 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "markdown", 1222 | "metadata": {}, 1223 | "source": [ 1224 | "### Programming task: implement `apply_nms` function, together with the `change_order` helper function which will be useful for you." 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "code", 1229 | "execution_count": null, 1230 | "metadata": { 1231 | "collapsed": true 1232 | }, 1233 | "outputs": [], 1234 | "source": [ 1235 | "# Final maximum number of proposals, as returned by NMS.\n", 1236 | "POST_NMS_TOP_N = 2000\n", 1237 | "\n", 1238 | "# IOU overlap threshold for the NMS procedure.\n", 1239 | "NMS_THRESHOLD = 0.7\n", 1240 | "\n", 1241 | "\n", 1242 | "# You might find the following function useful for re-ordening the coordinates\n", 1243 | "# as expected by Tensorflow.\n", 1244 | "def change_order(bboxes):\n", 1245 | " \"\"\"Change bounding box encoding order.\n", 1246 | "\n", 1247 | " Tensorflow works with the (y_min, x_min, y_max, x_max) order while we work\n", 1248 | " with the (x_min, y_min, x_max, y_min).\n", 1249 | "\n", 1250 | " While both encoding options have its advantages and disadvantages we\n", 1251 | " decided to use the (x_min, y_min, x_max, y_min), forcing us to switch to\n", 1252 | " Tensorflow's every time we want to use function that handles bounding\n", 1253 | " boxes.\n", 1254 | "\n", 1255 | " Arguments:\n", 1256 | " bboxes: A Tensor of shape (total_bboxes, 4).\n", 1257 | "\n", 1258 | " Returns:\n", 1259 | " bboxes: A Tensor of shape (total_bboxes, 4) with the order swaped.\n", 1260 | " \"\"\"\n", 1261 | " \n", 1262 | " # Tip: see `tf.unstack`, `tf.stack`.\n", 1263 | "\n", 1264 | " ####\n", 1265 | " # Fill this function below, paying attention to the docstring.\n", 1266 | " ####\n", 1267 | " \n", 1268 | " ####\n", 1269 | " \n", 1270 | " return bboxes\n", 1271 | "\n", 1272 | "\n", 1273 | "def apply_nms(proposals, scores):\n", 1274 | " \"\"\"Applies non-maximum suppression to proposals.\n", 1275 | " \n", 1276 | " Arguments:\n", 1277 | " proposals: Tensor of shape (num_proposals, 4), holding the\n", 1278 | " coordinates of the proposals' bounding boxes.\n", 1279 | " scores: Tensor of shape (num_proposals,), holding the\n", 1280 | " scores associated to each bounding box.\n", 1281 | " \n", 1282 | " Returns:\n", 1283 | " (`proposals`, `scores`), but with NMS applied, and ordered by score.\n", 1284 | " \"\"\"\n", 1285 | " \n", 1286 | " # Tip: See `tf.image.non_max_suppression` to perform NMS, our `change_order`\n", 1287 | " # to prepare the bounding boxes, and `tf.gather` to pick indices out of a\n", 1288 | " # Tensor.\n", 1289 | " \n", 1290 | " ####\n", 1291 | " # Fill this function below, paying attention to the docstring.\n", 1292 | " ####\n", 1293 | "\n", 1294 | " ####\n", 1295 | "\n", 1296 | " return proposals, scores\n", 1297 | "\n", 1298 | "pre_merge_proposals, pre_merge_scores = proposals, scores\n", 1299 | "proposals, scores = apply_nms(proposals, scores)" 1300 | ] 1301 | }, 1302 | { 1303 | "cell_type": "markdown", 1304 | "metadata": {}, 1305 | "source": [ 1306 | "## Learn: what have we detected? Play around with NMS.\n", 1307 | "\n", 1308 | "Let's take a look at our current results, so we can understand what we have to work with.\n", 1309 | "\n", 1310 | "1. How is it different than before?\n", 1311 | "2. Is the importance of something like NMS more clear now?" 1312 | ] 1313 | }, 1314 | { 1315 | "cell_type": "code", 1316 | "execution_count": null, 1317 | "metadata": {}, 1318 | "outputs": [], 1319 | "source": [ 1320 | "# Display the first `topn` proposals, as ordered by score.\n", 1321 | "@interact(\n", 1322 | " nms=Checkbox(value=True, description='Apply NMS'),\n", 1323 | " topn=pager(200, 1, min=1, value=10, description='Number of proposals')\n", 1324 | ")\n", 1325 | "def draw(nms, topn):\n", 1326 | " if nms:\n", 1327 | " p = proposals\n", 1328 | " s = scores\n", 1329 | " else:\n", 1330 | " p = pre_merge_proposals\n", 1331 | " s = pre_merge_scores\n", 1332 | " \n", 1333 | " print('Minimum score: {:.2f}'.format(s[topn]))\n", 1334 | " return draw_bboxes(image, p[:topn])" 1335 | ] 1336 | }, 1337 | { 1338 | "cell_type": "markdown", 1339 | "metadata": {}, 1340 | "source": [ 1341 | "Let's check how the center positions have changed pre- and post- merging of proposals when restricted to the first $2000$ proposals. After applying NMS, we should have improved our coverage of the image somewhat." 1342 | ] 1343 | }, 1344 | { 1345 | "cell_type": "code", 1346 | "execution_count": null, 1347 | "metadata": {}, 1348 | "outputs": [], 1349 | "source": [ 1350 | "top_k = tf.nn.top_k(pre_merge_scores, k=proposals.shape[0])\n", 1351 | "props = tf.gather(pre_merge_proposals, top_k.indices).numpy()\n", 1352 | "\n", 1353 | "pre_merge_centers = np.stack([\n", 1354 | " (props[:, 0] + props[:, 2]) / 2,\n", 1355 | " (props[:, 1] + props[:, 3]) / 2,\n", 1356 | "], axis=1)\n", 1357 | "\n", 1358 | "post_merge_centers = np.stack([\n", 1359 | " (proposals[:, 0] + proposals[:, 2]) / 2,\n", 1360 | " (proposals[:, 1] + proposals[:, 3]) / 2,\n", 1361 | "], axis=1)\n", 1362 | "\n", 1363 | "_, axes = plt.subplots(2, 2, figsize=(16, 8))\n", 1364 | "axes[0][0].set_title('x-axis center positions pre-merge')\n", 1365 | "axes[0][0].hist(pre_merge_centers[:, 0], bins=40)\n", 1366 | "axes[0][1].set_title('y-axis center positions pre-merge')\n", 1367 | "axes[0][1].hist(pre_merge_centers[:, 1], bins=40)\n", 1368 | "axes[1][0].set_title('x-axis center positions post-merge')\n", 1369 | "axes[1][0].hist(post_merge_centers[:, 0], bins=40)\n", 1370 | "axes[1][1].set_title('y-axis center positions post-merge')\n", 1371 | "axes[1][1].hist(post_merge_centers[:, 1], bins=40)\n", 1372 | "\n", 1373 | "plt.show()" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "markdown", 1378 | "metadata": {}, 1379 | "source": [ 1380 | "## RPN Conclusions\n", 1381 | "\n", 1382 | "This concludes the work on the Region Proposal Network! We now have a mechanism to, given an image of arbitrary size, return **regions of interest** (i.e. proposals), where it looks like an object is present.\n", 1383 | "\n", 1384 | "Having gone through all the steps, from generating anchors around the image to predicting and filtering proposals, we now have a list of `POST_NMS_TOP_N` proposals (two thousand, in this case), each with an objectness score assigned.\n", 1385 | "\n", 1386 | "Two thousand proposals are, of course, many more than what we need. Also, we need to assign an actual class to each of these proposals, or discard them if they're not correct. That will be attacked by the rest of our object detection pipeline." 1387 | ] 1388 | }, 1389 | { 1390 | "cell_type": "markdown", 1391 | "metadata": {}, 1392 | "source": [ 1393 | "---\n", 1394 | "# Standardizing proposals: Region of Interest Pooling\n", 1395 | "\n", 1396 | "So far, we have obtained regions of interest for an arbitrarily-sized input image. Thousands of them. And all of them of a different size. As you've probably seen in the last visualization, some of them may be very small while others very big.\n", 1397 | "\n", 1398 | "The objective of this stage is twofold:\n", 1399 | "1. Get the proposals, defined in **pixel-space coordinates**, back to the **feature maps**.\n", 1400 | "2. Get them all into a **fixed size** so they can later be fed into a fully-connected neural network.\n", 1401 | "\n", 1402 | "This final size of each region of interest will be $7\\times7\\times1024$.\n", 1403 | "> * $1024$ is the number of filters that our feature map has.\n", 1404 | "> * $7\\times7$ corresponds to the common spatial size all proposals will have.\n", 1405 | ">\n", 1406 | "> This implies, as you might notice, that the aspect ratio of the proposals will change.\n", 1407 | "\n", 1408 | "On the original Faster R-CNN paper, a technique called RoI pooling is used. Here, instead, we use the `tf.image.crop_and_resize` Tensorflow function, which is (in performance terms) almost equivalent but simpler to implement.\n", 1409 | "\n", 1410 | "Also, bear in mind that the RoI pooling layer **first** resizes to *double* of the pooling size (i.e. gets regions of $14\\times14$) and then uses max pooling to get the final $7\\times7$ regions. This makes the resulting regions more smooth and makes them capture more details. One could even go further and resize to $28\\times28$ or more, but since we're **making a copy** of the feature map, memory usage will rapidly go up (trade-offs...).\n", 1411 | "\n", 1412 | "So much for an introduction. The implementation should be relatively straightforward. So go on ahead!" 1413 | ] 1414 | }, 1415 | { 1416 | "cell_type": "markdown", 1417 | "metadata": {}, 1418 | "source": [ 1419 | "### Programming task: implement `roi_pooling` function and the `normalize_bboxes` helper." 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "code", 1424 | "execution_count": null, 1425 | "metadata": { 1426 | "collapsed": true 1427 | }, 1428 | "outputs": [], 1429 | "source": [ 1430 | "def normalize_bboxes(proposals, im_shape):\n", 1431 | " \"\"\"\n", 1432 | " Gets normalized coordinates for RoIs (between 0 and 1 for cropping)\n", 1433 | " in TensorFlow's order (y1, x1, y2, x2).\n", 1434 | "\n", 1435 | " Arguments:\n", 1436 | " roi_proposals: A Tensor with the bounding boxes of shape\n", 1437 | " (total_proposals, 4), where the values for each proposal are\n", 1438 | " (x_min, y_min, x_max, y_max).\n", 1439 | " im_shape: A Tensor with the shape of the image (height, width).\n", 1440 | "\n", 1441 | " Returns:\n", 1442 | " bboxes: A Tensor with normalized bounding boxes in TensorFlow's\n", 1443 | " format order. Its should is (total_proposals, 4).\n", 1444 | " \"\"\"\n", 1445 | " \n", 1446 | " # See `tf.unstack`, `tf.stack`, `tf.cast`.\n", 1447 | " \n", 1448 | " ####\n", 1449 | " # Fill this function below, paying attention to the docstring.\n", 1450 | " ####\n", 1451 | "\n", 1452 | " ####\n", 1453 | "\n", 1454 | " return bboxes\n", 1455 | "\n", 1456 | "\n", 1457 | "\n", 1458 | "def roi_pooling(feature_map, proposals, im_shape, pool_size=7):\n", 1459 | " \"\"\"Perform RoI pooling.\n", 1460 | "\n", 1461 | " This is a simplified method than what's done in the paper that obtains\n", 1462 | " similar results. We crop the proposal over the feature map and resize it\n", 1463 | " bilinearly.\n", 1464 | " \n", 1465 | " This function first resizes to *double* of `pool_size` (i.e. gets\n", 1466 | " regions of (pool_size * 2, pool_size * 2)) and then uses max pooling to\n", 1467 | " get the final `(pool_size, pool_size)` regions.\n", 1468 | " \n", 1469 | " Arguments:\n", 1470 | " feature_map: Tensor of shape (1, W, H, C), with WxH the spatial\n", 1471 | " shape of the feature map and C the number of channels (1024\n", 1472 | " in this case).\n", 1473 | " proposals: Tensor of shape (total_proposals, 4), holding the proposals\n", 1474 | " to perform RoI pooling on.\n", 1475 | " im_shape: A Tensor with the shape of the image (height, width).\n", 1476 | " pool_size (int): Final width/height of the pooled region.\n", 1477 | " \n", 1478 | " Returns:\n", 1479 | " Pooled feature map, with shape `(num_proposals, pool_size, pool_size,\n", 1480 | " feature_map_channels)`.\n", 1481 | " \"\"\"\n", 1482 | " \n", 1483 | " # Tip: See `tf.image.crop_and_resize` to get crops out of the feature map\n", 1484 | " # and resize them.\n", 1485 | " \n", 1486 | " # Tip: You can **ignore the `box_ind` argument** by passing an array of the\n", 1487 | " # correct size filled with zeros (one per proposal). This is because we are\n", 1488 | " # using batch size of one.\n", 1489 | " \n", 1490 | " # Tip: Remember to resize to `2 * pool_size` first.\n", 1491 | " \n", 1492 | " # Tip: Remember to perform the max pooling as described above, by using\n", 1493 | " # the `tf.nn.max_pool` function.\n", 1494 | " \n", 1495 | " # N.B.: You can resize to `(pool_size, pool_size)` directly and avoid the\n", 1496 | " # max pooling step, though the results *will* be inferior.\n", 1497 | " \n", 1498 | " ####\n", 1499 | " # Fill this function below, paying attention to the docstring.\n", 1500 | " ####\n", 1501 | "\n", 1502 | " ####\n", 1503 | "\n", 1504 | " return pooled\n", 1505 | "\n", 1506 | "\n", 1507 | "pooled = roi_pooling(feature_map, proposals, (image.shape[1], image.shape[2]))" 1508 | ] 1509 | }, 1510 | { 1511 | "cell_type": "markdown", 1512 | "metadata": {}, 1513 | "source": [ 1514 | "In order to gain an intuition on what exactly is being done here, let's now visualize our **pooled regions of interest**, along with the image patches they come from." 1515 | ] 1516 | }, 1517 | { 1518 | "cell_type": "code", 1519 | "execution_count": null, 1520 | "metadata": { 1521 | "scrolled": false 1522 | }, 1523 | "outputs": [], 1524 | "source": [ 1525 | "pool = pooled.numpy()\n", 1526 | "\n", 1527 | "# Pool the images too to visualize, but using a higher pooling size so\n", 1528 | "# we don't lose too much resolution.\n", 1529 | "image_crops = roi_pooling(\n", 1530 | " image, proposals,\n", 1531 | " (image.shape[1], image.shape[2]),\n", 1532 | " pool_size=140\n", 1533 | ").numpy().astype(np.uint8)\n", 1534 | "\n", 1535 | "\n", 1536 | "@interact(\n", 1537 | " fm_idx=pager(pool.shape[-1], 1, 'Feature map index'),\n", 1538 | " im_idx=pager(pool.shape[0], 25, 'Proposals')\n", 1539 | ")\n", 1540 | "def display_pooled_proposal(fm_idx=0, im_idx=0):\n", 1541 | " axes = image_grid(25, 5, sizes=(3, 3))\n", 1542 | " \n", 1543 | " for idx, ax in enumerate(axes):\n", 1544 | " if im_idx * 25 + idx >= pool.shape[0]:\n", 1545 | " break\n", 1546 | " \n", 1547 | " fm = (\n", 1548 | " pool[idx, :, :, fm_idx]\n", 1549 | " / pool[idx, :, :, fm_idx].max()\n", 1550 | " * 255\n", 1551 | " ).astype(np.uint8)\n", 1552 | " \n", 1553 | " # Get the pooled image regions.\n", 1554 | " img = image_crops[im_idx * 25 + idx, ...]\n", 1555 | " \n", 1556 | " fm_image = Image.fromarray(fm, mode='L').convert('RGBA')\n", 1557 | " fm_image = fm_image.resize(img.shape[0:2][::-1], resample=Image.NEAREST)\n", 1558 | "\n", 1559 | " # Add some alpha to overlay it over the image.\n", 1560 | " fm_image.putalpha(120)\n", 1561 | "\n", 1562 | " base_image = Image.fromarray(img)\n", 1563 | " base_image.paste(fm_image, (0, 0), fm_image)\n", 1564 | " \n", 1565 | " ax.imshow(base_image, aspect='auto')\n", 1566 | "\n", 1567 | " plt.subplots_adjust(wspace=.02, hspace=.02)\n", 1568 | " plt.show()" 1569 | ] 1570 | }, 1571 | { 1572 | "cell_type": "markdown", 1573 | "metadata": {}, 1574 | "source": [ 1575 | "---\n", 1576 | "# Using the proposals: Region-CNN\n", 1577 | "\n", 1578 | "We're ready for the final stage! Here we'll be doing two things:\n", 1579 | "* Running our set of fixed-sized proposals through a network akin to what was done in the RPN: one input, two outputs. In this case, instead of an objectness score, the output will be a class score (plus a possible **background** score).\n", 1580 | "* Get these thousands of proposals into a reasonable number. We'll be performing NMS again, but this time per class.\n", 1581 | "\n", 1582 | "As you can see, it looks like more of the same, which (save some details) it effectively is." 1583 | ] 1584 | }, 1585 | { 1586 | "cell_type": "code", 1587 | "execution_count": null, 1588 | "metadata": {}, 1589 | "outputs": [], 1590 | "source": [ 1591 | "# We're finally ready to perform the classification, so load the class names.\n", 1592 | "with open('checkpoint/classes.json') as f:\n", 1593 | " classes = json.load(f)\n", 1594 | " \n", 1595 | "print(classes)" 1596 | ] 1597 | }, 1598 | { 1599 | "cell_type": "markdown", 1600 | "metadata": {}, 1601 | "source": [ 1602 | "## The classification network\n", 1603 | "\n", 1604 | "As we mentioned before, this last stage will get the proposals through a fully-connected layer. However, before doing that, we'll perform a bit more feature extraction.\n", 1605 | "\n", 1606 | "You might remember when you were implementing the RPN that we used the first three out of four blocks of the ResNet for feature extraction, discarding the final block. The reasoning behind this move was to leverage the fact that the block three should, in principle, detect more *abstract features* than the final, block four. Now we're ready to perform the final classification, so we will first pass our proposals (which are cuts, although resized, of the original feature map) through **the block four of the ResNet**.\n", 1607 | "\n", 1608 | "Also, once we do this, since we've already extracted all the features we care about, we'll perform **Global Average Pooling** which means, essentialy, to average out the spatial information: we only care that in a given proposal, some feature was present all around. That means we'll be left with a single vector per proposal.\n", 1609 | "\n", 1610 | "And, finally, pass this fixed-length vector through two fully-connected layers: one for the bounding box resizings and one for the classes. Since we've used the block four of the ResNet already, we'll not be using an intermediate layer.\n", 1611 | "\n", 1612 | "> ### Dimensions sanity check helper\n", 1613 | "> In the case of using 2000 proposals and the ResNet as we did before:\n", 1614 | ">\n", 1615 | "> 1. `proposals` should be of shape $(2000, 7, 7, 1024)$.\n", 1616 | "> 2. After running through ResNet tail, we should get something of shape $(2000, 7, 7, 2048)$.\n", 1617 | "> 3. After Global Average Pooling, we should condense the spatial dimensions and get $(2000, 2048)$.\n", 1618 | "> 4. Class scores should be $(2000, 81)$ (80 classes + background).\n", 1619 | "> 5. Box regression scores should be $(2000, 320)$ ($80 \\times 4$ since we have 4 coordinates for each box).\n" 1620 | ] 1621 | }, 1622 | { 1623 | "cell_type": "code", 1624 | "execution_count": null, 1625 | "metadata": {}, 1626 | "outputs": [], 1627 | "source": [ 1628 | "print(run_resnet_tail.__doc__)" 1629 | ] 1630 | }, 1631 | { 1632 | "cell_type": "markdown", 1633 | "metadata": {}, 1634 | "source": [ 1635 | "### Programming task: implement `run_rcnn` function." 1636 | ] 1637 | }, 1638 | { 1639 | "cell_type": "code", 1640 | "execution_count": null, 1641 | "metadata": { 1642 | "collapsed": true 1643 | }, 1644 | "outputs": [], 1645 | "source": [ 1646 | "def run_rcnn(pooled, num_classes):\n", 1647 | " \"\"\"Run the RCNN layers through the pooled features.\n", 1648 | "\n", 1649 | " This directly applies a fully-connected layer from `features`\n", 1650 | " to the two outputs we want: a class probability (plus the\n", 1651 | " background class) and the bounding box resizings (one per\n", 1652 | " class).\n", 1653 | " \n", 1654 | " In order to obtain the class probability, we apply a softmax\n", 1655 | " over the scores obtained from the dense layer, similar to the RPN.\n", 1656 | " \n", 1657 | " Arguments:\n", 1658 | " pooled: Pooled feature map, with shape `(num_proposals,\n", 1659 | " pool_size, pool_size, feature_map_channels)`.\n", 1660 | " num_classes: Number of classes for the R-CNN.\n", 1661 | " \n", 1662 | " Returns:\n", 1663 | " Tuple of Tensors, with the first being the output of the\n", 1664 | " bbox resizings `(W * H * proposals, 4)` and the second being\n", 1665 | " the class scores, of size `(pool_size ^ 2 * proposals,\n", 1666 | " num_classes)`.\n", 1667 | " \"\"\"\n", 1668 | " \n", 1669 | " # Remember, you need to do three things with `pooled`:\n", 1670 | " # * Pass them through the ResNet block four.\n", 1671 | " # (Tip: See the function `run_resnet_tail`s docstring above.)\n", 1672 | " # * Perform Global Average Pooling.\n", 1673 | " # (Tip: See the function `tf.reduce_mean`.)\n", 1674 | " # * Run them through two fully-connected layers.\n", 1675 | " # (Tip: See the functions `tf.layers.dense`, `tf.nn.softmax`.)\n", 1676 | " \n", 1677 | " # W.r.t the fully-connected layers, remember:\n", 1678 | " # * To add an extra class for the background class.\n", 1679 | " # * To have bounding-box resizings **per-class**.\n", 1680 | " \n", 1681 | " # The names of the layers should be: `rcnn/fc_classifier` for\n", 1682 | " # the classification head, and `rcnn/fc_bbox` for the bbox\n", 1683 | " # resizing head.\n", 1684 | " \n", 1685 | " ####\n", 1686 | " # Fill this function below, paying attention to the docstring.\n", 1687 | " ####\n", 1688 | "\n", 1689 | " ####\n", 1690 | "\n", 1691 | " return rcnn_bbox, rcnn_cls_prob\n", 1692 | "\n", 1693 | "\n", 1694 | "with tfe.restore_variables_on_create('checkpoint/fasterrcnn'):\n", 1695 | " bbox_pred, cls_prob = run_rcnn(pooled, len(classes))\n", 1696 | " \n", 1697 | " \n", 1698 | "assert bbox_pred.shape[0] == pooled.shape[0], 'Number of proposals should match'\n", 1699 | "assert cls_prob.shape[0] == pooled.shape[0], 'Number of proposals should match'\n", 1700 | "\n", 1701 | "assert bbox_pred.shape[1] == len(classes) * 4, 'There should be 4 bbox resizings per class'\n", 1702 | "assert cls_prob.shape[1] == len(classes) + 1, 'There should be 81 class probabilities (remember the background!)'" 1703 | ] 1704 | }, 1705 | { 1706 | "cell_type": "markdown", 1707 | "metadata": {}, 1708 | "source": [ 1709 | "### Learn: play around with proposals and their corresponding class determined by R-CNN" 1710 | ] 1711 | }, 1712 | { 1713 | "cell_type": "markdown", 1714 | "metadata": {}, 1715 | "source": [ 1716 | "Let's take a look at the results now, and see whether they make sense.\n", 1717 | "\n", 1718 | "We'll display the **most probable class for each proposal**, before applying the final class-specific resizing: this is the pooled region of interest, what the classifier actually looked at." 1719 | ] 1720 | }, 1721 | { 1722 | "cell_type": "code", 1723 | "execution_count": null, 1724 | "metadata": {}, 1725 | "outputs": [], 1726 | "source": [ 1727 | "output_classes = ['background'] + classes\n", 1728 | " \n", 1729 | "preds = np.argmax(cls_prob.numpy(), axis=1)\n", 1730 | "\n", 1731 | "@interact(page=pager(len(preds), 20, 'Proposals'))\n", 1732 | "def display_predictions(page):\n", 1733 | " axes = image_grid(20, 5, sizes=(3, 3))\n", 1734 | " \n", 1735 | " for idx, ax in enumerate(axes):\n", 1736 | " if 20 * page + idx >= image_crops.shape[0]:\n", 1737 | " break\n", 1738 | " \n", 1739 | " ax.imshow(image_crops[20 * page + idx, ...], aspect='auto')\n", 1740 | " ax.set_title(output_classes[preds[20 * page + idx]])\n", 1741 | "\n", 1742 | " plt.subplots_adjust(wspace=.02, hspace=.15)\n", 1743 | " plt.show()" 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "markdown", 1748 | "metadata": {}, 1749 | "source": [ 1750 | "### Learn: play around with bounding box resizings (object \"refinements\")" 1751 | ] 1752 | }, 1753 | { 1754 | "cell_type": "markdown", 1755 | "metadata": {}, 1756 | "source": [ 1757 | "We'll now display the proposals again, but with the **bounding box resizings applied**.\n", 1758 | "\n", 1759 | "Corrections are done **per-class**. In order to understand how much these predictions vary, we take some proposals and apply the different possible resizings.\n", 1760 | "\n", 1761 | "For each region, we first display the resizing for the most probable class and then for three other random classes. If the most probable class is background, we ignore it.\n", 1762 | "\n", 1763 | "Do you notice anything in particular? Which resizing is the one that fits better to the detected object?" 1764 | ] 1765 | }, 1766 | { 1767 | "cell_type": "code", 1768 | "execution_count": null, 1769 | "metadata": {}, 1770 | "outputs": [], 1771 | "source": [ 1772 | "# Target normalization variances to adjust the output of the R-CNN so it trains better.\n", 1773 | "TARGET_VARIANCES = np.array([0.1, 0.1, 0.2, 0.2], dtype=np.float32)\n", 1774 | "\n", 1775 | "# We only consider proposals for which the most-probable class was non-background.\n", 1776 | "preds = np.argmax(cls_prob.numpy(), axis=1)\n", 1777 | "non_background = (preds != 0)\n", 1778 | "\n", 1779 | "non_bg_preds = preds[non_background]\n", 1780 | "non_bg_proposals = proposals.numpy()[non_background]\n", 1781 | "non_bg_bboxes = bbox_pred.numpy()[non_background]\n", 1782 | "non_bg_count = len(np.flatnonzero(non_background))\n", 1783 | "\n", 1784 | "\n", 1785 | "@interact(page=pager(non_bg_count, 3, 'Proposals'))\n", 1786 | "def display_resizings(page):\n", 1787 | " _, axes = plt.subplots(3, 5, figsize=(16, 10))\n", 1788 | " \n", 1789 | " for row_idx, cols in enumerate(axes):\n", 1790 | " for col in cols:\n", 1791 | " col.axis('off')\n", 1792 | "\n", 1793 | " proposal_idx = 3 * page + row_idx \n", 1794 | " if proposal_idx >= non_bg_count:\n", 1795 | " continue\n", 1796 | " \n", 1797 | " # Original region.\n", 1798 | " # (Using original region size so comparison is easier to the eye.)\n", 1799 | " x_min, y_min, x_max, y_max = clip_boxes(\n", 1800 | " non_bg_proposals[proposal_idx:proposal_idx + 1],\n", 1801 | " image.shape[1:3]\n", 1802 | " )[0].numpy().astype(np.int)\n", 1803 | " cols[0].imshow(image[0, y_min:y_max, x_min:x_max, :])\n", 1804 | " cols[0].set_title('Region')\n", 1805 | " \n", 1806 | " # Per-class region, correct class first.\n", 1807 | " class_ids = np.concatenate([\n", 1808 | " np.array([non_bg_preds[proposal_idx] - 1]),\n", 1809 | " np.random.randint(0, len(classes), 3)\n", 1810 | " ])\n", 1811 | " for col, class_id in zip(cols[1:], class_ids):\n", 1812 | " cls_bbox_pred = non_bg_bboxes[\n", 1813 | " proposal_idx:proposal_idx + 1,\n", 1814 | " (4 * class_id):(4 * class_id + 4)\n", 1815 | " ]\n", 1816 | "\n", 1817 | " cls_objects = decode(\n", 1818 | " non_bg_proposals[proposal_idx:proposal_idx+1],\n", 1819 | " cls_bbox_pred * TARGET_VARIANCES\n", 1820 | " ).numpy()\n", 1821 | " \n", 1822 | " x_min, y_min, x_max, y_max = clip_boxes(\n", 1823 | " cls_objects, image.shape[1:3]\n", 1824 | " )[0].numpy().astype(np.int)\n", 1825 | "\n", 1826 | " col.imshow(image[0, y_min:y_max, x_min:x_max, :])\n", 1827 | " col.set_title(classes[class_id])\n", 1828 | " \n", 1829 | " plt.subplots_adjust(wspace=.02, hspace=.15)\n", 1830 | " plt.show()" 1831 | ] 1832 | }, 1833 | { 1834 | "cell_type": "markdown", 1835 | "metadata": {}, 1836 | "source": [ 1837 | "## Filtering the object candidates\n", 1838 | "\n", 1839 | "We're finally getting there! We have one last step to do: getting the final predictions.\n", 1840 | "\n", 1841 | "What we have now is a list of `POST_NMS_TOP_N` proposals (around $2000$), each with 81 class scores (80 classes plus the background) and 80 bounding box resizings. Out of this, we'll have a total of $2000 \\times 80 = 160000$ candidate objects, each with a **score** (its class score) and **bounding box resizing**. We do this in order to consider **all** possible object classifications for a given region proposal: if the most-probable class of a bounding box has a score of $0.48$ and the second one has a score of $0.47$, it is important to consider **both** variants, and not just the highest-scored one.\n", 1842 | "\n", 1843 | "Of course, we'll not really build the $160000$ proposals at once, but instead perform NMS **on a class-by-class basis**, keeping only the top $100$ proposals per class. Then we'll order all the proposals and keep only the top $300$, which will be the output of our algorithm.\n", 1844 | "\n", 1845 | "**We've already implemented all this part for you**, as it's same as above but on a class-by-class basis. With a little work, you should be able to do it by yourself if you want." 1846 | ] 1847 | }, 1848 | { 1849 | "cell_type": "code", 1850 | "execution_count": null, 1851 | "metadata": { 1852 | "collapsed": true 1853 | }, 1854 | "outputs": [], 1855 | "source": [ 1856 | "objects, labels, probs = rcnn_proposals(\n", 1857 | " proposals, bbox_pred, cls_prob, image.shape[1:3], 80,\n", 1858 | " min_prob_threshold=0.0,\n", 1859 | ")\n", 1860 | "\n", 1861 | "objects = objects.numpy()\n", 1862 | "labels = labels.numpy()\n", 1863 | "probs = probs.numpy()" 1864 | ] 1865 | }, 1866 | { 1867 | "cell_type": "code", 1868 | "execution_count": null, 1869 | "metadata": {}, 1870 | "outputs": [], 1871 | "source": [ 1872 | "print('Number of detections above 0.1 probability:', len(labels[probs > 0.1]))\n", 1873 | "print('Number of detections above 0.5 probability:', len(labels[probs > 0.5]))\n", 1874 | "print('Number of detections above 0.7 probability:', len(labels[probs > 0.7]))\n", 1875 | "print()\n", 1876 | "\n", 1877 | "# Accumluated probability score per class.\n", 1878 | "probs_per_class = np.bincount(labels, weights=probs, minlength=len(classes))\n", 1879 | "top_n = probs_per_class.argsort()[::-1][:5]\n", 1880 | "print('Top 5 predicted classes:')\n", 1881 | "for cls_idx in top_n:\n", 1882 | " print(' {}: {} ({:.2f})'.format(cls_idx, classes[cls_idx], probs_per_class[cls_idx]))\n", 1883 | " \n", 1884 | "_, ax = plt.subplots(1, figsize=(16, 4))\n", 1885 | "ax.bar(np.arange(80), probs_per_class)\n", 1886 | "\n", 1887 | "plt.show()" 1888 | ] 1889 | }, 1890 | { 1891 | "cell_type": "markdown", 1892 | "metadata": {}, 1893 | "source": [ 1894 | "At the end of the day, however, we don't want $300$ low-quality detections, but whatever is good. What we do in practice is to filter detections by their probability score (which is the candidate object's class score).\n", 1895 | "\n", 1896 | "Let's take a look at the $300$ candidate objects and see how much the predictions change when filtering by score. In the real world, you'll probably use a threshold above $0.5$, depending on your desired precision vs. recall trade-off (too high a threshold will miss detections, while too low will add noise)." 1897 | ] 1898 | }, 1899 | { 1900 | "cell_type": "code", 1901 | "execution_count": null, 1902 | "metadata": {}, 1903 | "outputs": [], 1904 | "source": [ 1905 | "slider = FloatSlider(\n", 1906 | " min=0.0, max=1.0, step=0.01, value=0.7,\n", 1907 | " description='Probability threshold',\n", 1908 | " layout=Layout(width='600px'),\n", 1909 | " style={'description_width': 'initial'},\n", 1910 | " continuous_update=False\n", 1911 | ")\n", 1912 | "\n", 1913 | "@interact(prob=slider)\n", 1914 | "def display_objects(prob):\n", 1915 | " MAX_TO_DRAW = 50\n", 1916 | "\n", 1917 | " mask = probs > prob\n", 1918 | "\n", 1919 | " return draw_bboxes_with_labels(\n", 1920 | " image, classes,\n", 1921 | " objects[mask][:MAX_TO_DRAW],\n", 1922 | " labels[mask][:MAX_TO_DRAW],\n", 1923 | " probs[mask][:MAX_TO_DRAW],\n", 1924 | " )" 1925 | ] 1926 | }, 1927 | { 1928 | "cell_type": "markdown", 1929 | "metadata": {}, 1930 | "source": [ 1931 | "---\n", 1932 | "# Summing up\n", 1933 | "\n", 1934 | "**Congratulations!** You finished your own implementation of Faster R-CNN, one of the state-of-the-art object detection algorithms.\n", 1935 | "\n", 1936 | "Throughout this notebook you should have learned quite a few things:\n", 1937 | "* **How modern object detectors work**: what inputs they take, what kinds of operations and logic they do, and how much control we have in their workings.\n", 1938 | "* In particular, **how Faster R-CNN works**, very much in depth.\n", 1939 | "* **How to use Tensorflow and numpy in the context of computer vision and object detection**. Going a bit more than the usual \"stack three layers and call it a day\": we've worked with arbitrarily-sized inputs, used conditionals, filtering and other non-standard functions, all within the Tensorflow graph (meaning it can run entirely within a GPU).\n", 1940 | "* **How to visualize the inner workings of an object detection pipeline**. By leveraging an already-trained network, we could see and corroborate each step of the pipeline to understand what goes behind the scenes and whether we made any errors. This process may have also given you some clues in how to improve the algorithm itself.\n", 1941 | "\n", 1942 | "This, of course, is just the beginning. Some things you could try, going forward, are:\n", 1943 | "* **Implement the training of a Faster R-CNN model**. We barely touched on this part, using a pre-trained checkpoint provided by us. The training, apart from using the autograd of your favorite deep learning library, requires some extra steps:\n", 1944 | "\n", 1945 | " * Implement the **targets**. We're using supervised learning to train this, so how does training data fit into this? We need to train both the RPN and the R-CNN. In order to do this, we need to build mini-batches of training data for both components, by matching ground-truth boxes to our proposals. You can learn how this is done in our implementation of [Luminoth](https://github.com/tryolabs/luminoth).\n", 1946 | "\n", 1947 | " * Implement the **loss functions**. Once the targets are in place, we need to select good losses and balance our dataset in order to train correctly. There may be some difficulties when training, as we are optimizing four losses in total (two for the RPN, two for the R-CNN).\n", 1948 | "\n", 1949 | "* Improve the algorithm itself. Faster-RCNN has been out for a while now, and while it's still very competitive, there are some known improvements to do. For instance, the RPN can be replaced entirely with the **Feature Pyramid Network** (FPN) [1] and the loss exchanged with the **Focal Loss** [2], to obtain the algorithm called **RetinaNet**.\n", 1950 | "\n", 1951 | "\n", 1952 | "* [1] https://arxiv.org/pdf/1612.03144.pdf\n", 1953 | "* [2] https://arxiv.org/pdf/1708.02002.pdf" 1954 | ] 1955 | } 1956 | ], 1957 | "metadata": { 1958 | "kernelspec": { 1959 | "display_name": "Python 3", 1960 | "language": "python", 1961 | "name": "python3" 1962 | }, 1963 | "language_info": { 1964 | "codemirror_mode": { 1965 | "name": "ipython", 1966 | "version": 3 1967 | }, 1968 | "file_extension": ".py", 1969 | "mimetype": "text/x-python", 1970 | "name": "python", 1971 | "nbconvert_exporter": "python", 1972 | "pygments_lexer": "ipython3", 1973 | "version": "3.6.7" 1974 | } 1975 | }, 1976 | "nbformat": 4, 1977 | "nbformat_minor": 2 1978 | } 1979 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Workshop: Object detection with Deep Learning 2 | 3 | ## Hands-on 1: Implementing Faster R-CNN (almost) from scratch 4 | 5 | Consists of a Jupyter Notebook that will guide you to complete the implementation of a 6 | Faster R-CNN model for object detection. 7 | 8 | ### Instructions 9 | 10 | 1. Clone the repository. 11 | 2. Execute the `download_checkpoint.sh` script from the base directory in order to 12 | download the pre-trained checkpoint: 13 | 14 | ```bash 15 | cd object-detection-workshop 16 | ./download_checkpoint.sh 17 | ``` 18 | 19 | 3. Install the auxiliary library `workshop`: 20 | 21 | ```bash 22 | pip install -e . 23 | ``` 24 | 25 | 4. Run Jupyter Notebook from the base directory: 26 | 27 | ```bash 28 | jupyter notebook 29 | ``` 30 | 31 | 5. Open the notebook **Implementing Faster R-CNN.ipynb**. 32 | 33 | 6. Read the instructions carefully and complete everything :) 34 | 35 | ## Hands on 2: using Luminoth for real world object detection 36 | 37 | Guided demo of [Luminoth](http://luminoth.ai/) toolkit, which will teach you the most 38 | important functionalities and how to train your own models. 39 | 40 | [Access the Luminoth tutorial](https://luminoth.readthedocs.io/en/latest/tutorial/index.html). 41 | 42 | --- 43 | 44 | Copyright © 2018, [Tryolabs](https://tryolabs.com/). -------------------------------------------------------------------------------- /download_checkpoint.sh: -------------------------------------------------------------------------------- 1 | wget 'http://object-detection-workshop.s3-website-us-east-1.amazonaws.com/checkpoint.tar.gz' 2 | tar xzvf checkpoint.tar.gz 3 | rm checkpoint.tar.gz 4 | -------------------------------------------------------------------------------- /images/bicycles.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/images/bicycles.jpg -------------------------------------------------------------------------------- /images/cats.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/images/cats.jpg -------------------------------------------------------------------------------- /images/horse.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/images/horse.jpg -------------------------------------------------------------------------------- /images/kids.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/images/kids.jpg -------------------------------------------------------------------------------- /images/kittens.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/images/kittens.png -------------------------------------------------------------------------------- /images/woman.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/images/woman.jpg -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='workshop', 5 | version='0.0.1', 6 | packages=find_packages(), 7 | install_requires=[ 8 | 'numpy', 9 | 'tensorflow', 10 | 'click', 11 | 'ipython', 12 | 'ipdb', 13 | 'jupyter', 14 | 'matplotlib', 15 | 'Pillow', 16 | ], 17 | ) 18 | -------------------------------------------------------------------------------- /workshop/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/workshop/__init__.py -------------------------------------------------------------------------------- /workshop/faster.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | from workshop.resnet import resnet_v1_101, resnet_v1_101_tail 5 | 6 | 7 | _R_MEAN = 123.68 8 | _G_MEAN = 116.78 9 | _B_MEAN = 103.94 10 | 11 | 12 | OUTPUT_STRIDE = 16 13 | 14 | CLASS_NMS_THRESHOLD = 0.5 15 | TOTAL_MAX_DETECTIONS = 300 16 | 17 | 18 | def sort_anchors(anchors): 19 | """Sort the anchor references aspect ratio first, then area.""" 20 | widths = anchors[:, 2] - anchors[:, 0] 21 | heights = anchors[:, 3] - anchors[:, 1] 22 | 23 | aspect_ratios = np.round(heights / widths, 1) 24 | areas = widths * heights 25 | 26 | return anchors[np.lexsort((areas, aspect_ratios)), :] 27 | 28 | 29 | def generate_anchors_reference(base_size, aspect_ratios, scales): 30 | """Generate base set of anchors to be used as reference for all anchors. 31 | 32 | Anchors vary only in width and height. Using the base_size and the 33 | different ratios we can calculate the desired widths and heights. 34 | 35 | Aspect ratios maintain the area of the anchors, while scales apply to the 36 | length of it (and thus affect it squared). 37 | 38 | Arguments: 39 | base_size (int): Base size of the base anchor (square). 40 | aspect_ratios: Ratios to use to generate different anchors. The ratio 41 | is the value of height / width. 42 | scales: Scaling ratios applied to length. 43 | 44 | Returns: 45 | anchors: Numpy array with shape (total_aspect_ratios * total_scales, 4) 46 | with the corner points of the reference base anchors using the 47 | convention (x_min, y_min, x_max, y_max). 48 | """ 49 | scales_grid, aspect_ratios_grid = np.meshgrid(scales, aspect_ratios) 50 | base_scales = scales_grid.reshape(-1) 51 | base_aspect_ratios = aspect_ratios_grid.reshape(-1) 52 | 53 | aspect_ratio_sqrts = np.sqrt(base_aspect_ratios) 54 | heights = base_scales * aspect_ratio_sqrts * base_size 55 | widths = base_scales / aspect_ratio_sqrts * base_size 56 | 57 | # Center point has the same X, Y value. 58 | center_xy = 0 59 | 60 | # Create anchor reference. 61 | anchors = np.column_stack([ 62 | center_xy - widths / 2, 63 | center_xy - heights / 2, 64 | center_xy + widths / 2, 65 | center_xy + heights / 2, 66 | ]) 67 | 68 | # references = generate_anchors_reference( 69 | # 256, # Base size. 70 | # [0.5, 1, 2], # Aspect ratios. 71 | # [0.125, 0.25, 0.5, 1, 2], # Scales. 72 | # ) 73 | 74 | # print('Anchor references (real image size):') 75 | # print() 76 | # print(references) 77 | 78 | # # We should have obtained 5 areas and 3 different aspect ratios in our 79 | # # anchor references. 80 | # widths = references[:, 2] - references[:, 0] 81 | # heights = references[:, 3] - references[:, 1] 82 | 83 | # aspect_ratios = np.round(heights / widths, 1) 84 | # areas = widths * heights 85 | 86 | # assert len(np.unique(areas)) == 5 87 | # assert len(np.unique(aspect_ratios)) == 3 88 | 89 | # print('Areas:', len(np.unique(areas))) 90 | # print('Aspect ratios:', len(np.unique(aspect_ratios))) 91 | 92 | # We sort the anchors to the value expected by our pre-trained network. 93 | return sort_anchors(anchors) 94 | 95 | 96 | def change_order(bboxes): 97 | first_min, second_min, first_max, second_max = tf.unstack( 98 | bboxes, axis=1 99 | ) 100 | bboxes = tf.stack( 101 | [second_min, first_min, second_max, first_max], axis=1 102 | ) 103 | return bboxes 104 | 105 | 106 | def get_width_upright(bboxes): 107 | bboxes = tf.cast(bboxes, tf.float32) 108 | x1, y1, x2, y2 = tf.split(bboxes, 4, axis=1) 109 | width = x2 - x1 110 | height = y2 - y1 111 | 112 | # Calculate up right point of bbox (urx = up right x) 113 | urx = x1 + .5 * width 114 | ury = y1 + .5 * height 115 | 116 | return width, height, urx, ury 117 | 118 | 119 | def clip_boxes(bboxes, imshape): 120 | """ 121 | Clips bounding boxes to image boundaries based on image shape. 122 | 123 | Args: 124 | bboxes: Tensor with shape (num_bboxes, 4) 125 | where point order is x1, y1, x2, y2. 126 | 127 | imshape: Tensor with shape (2, ) 128 | where the first value is height and the next is width. 129 | 130 | Returns 131 | Tensor with same shape as bboxes but making sure that none 132 | of the bboxes are outside the image. 133 | """ 134 | bboxes = tf.cast(bboxes, dtype=tf.float32) 135 | imshape = tf.cast(imshape, dtype=tf.float32) 136 | 137 | x1, y1, x2, y2 = tf.split(bboxes, 4, axis=1) 138 | width = imshape[1] 139 | height = imshape[0] 140 | x1 = tf.maximum(tf.minimum(x1, width - 1.0), 0.0) 141 | x2 = tf.maximum(tf.minimum(x2, width - 1.0), 0.0) 142 | 143 | y1 = tf.maximum(tf.minimum(y1, height - 1.0), 0.0) 144 | y2 = tf.maximum(tf.minimum(y2, height - 1.0), 0.0) 145 | 146 | bboxes = tf.concat([x1, y1, x2, y2], axis=1) 147 | 148 | return bboxes 149 | 150 | 151 | def run_base_network(inputs): 152 | """Obtain the feature map for an input image.""" 153 | # Pre-process inputs as required by the Resnet (just substracting means). 154 | means = tf.constant([_R_MEAN, _G_MEAN, _B_MEAN], dtype=tf.float32) 155 | processed_inputs = inputs - means 156 | 157 | _, endpoints = resnet_v1_101( 158 | processed_inputs, 159 | training=False, 160 | global_pool=False, 161 | output_stride=OUTPUT_STRIDE, 162 | ) 163 | 164 | feature_map = endpoints['resnet_v1_101/block3'] 165 | 166 | return feature_map 167 | 168 | 169 | def run_resnet_tail(inputs): 170 | """Pass `inputs` through the last block of the Resnet. 171 | 172 | Arguments: 173 | inputs: Tensor of shape (total_proposals, pool_size, pool_size, 1024), 174 | the result of the RoI pooling layer. 175 | 176 | Returns: 177 | Tensor of shape (total_proposals, pool_size, pool_size, 2048), with the 178 | output of the final block. 179 | """ 180 | return resnet_v1_101_tail(inputs)[0] 181 | 182 | 183 | def decode(roi, deltas): 184 | ( 185 | roi_width, roi_height, roi_urx, roi_ury 186 | ) = get_width_upright(roi) 187 | 188 | dx, dy, dw, dh = tf.split(deltas, 4, axis=1) 189 | 190 | pred_ur_x = dx * roi_width + roi_urx 191 | pred_ur_y = dy * roi_height + roi_ury 192 | pred_w = tf.exp(dw) * roi_width 193 | pred_h = tf.exp(dh) * roi_height 194 | 195 | bbox_x1 = pred_ur_x - 0.5 * pred_w 196 | bbox_y1 = pred_ur_y - 0.5 * pred_h 197 | 198 | bbox_x2 = pred_ur_x + 0.5 * pred_w 199 | bbox_y2 = pred_ur_y + 0.5 * pred_h 200 | 201 | bboxes = tf.concat([ 202 | bbox_x1, bbox_y1, bbox_x2, bbox_y2 203 | ], axis=1) 204 | 205 | return bboxes 206 | 207 | 208 | def rcnn_proposals(proposals, bbox_pred, cls_prob, im_shape, num_classes, 209 | min_prob_threshold=0.0, class_max_detections=100): 210 | """ 211 | Args: 212 | proposals: Tensor with the RPN proposals bounding boxes. 213 | Shape (num_proposals, 4). Where num_proposals is less than 214 | POST_NMS_TOP_N (We don't know exactly beforehand) 215 | bbox_pred: Tensor with the RCNN delta predictions for each proposal 216 | for each class. Shape (num_proposals, 4 * num_classes) 217 | cls_prob: A softmax probability for each proposal where the idx = 0 218 | is the background class (which we should ignore). 219 | Shape (num_proposals, num_classes + 1) 220 | 221 | Returns: 222 | objects: 223 | Shape (final_num_proposals, 4) 224 | Where final_num_proposals is unknown before-hand (it depends on 225 | NMS). The 4-length Tensor for each corresponds to: 226 | (x_min, y_min, x_max, y_max). 227 | objects_label: 228 | Shape (final_num_proposals,) 229 | objects_label_prob: 230 | Shape (final_num_proposals,) 231 | 232 | """ 233 | selected_boxes = [] 234 | selected_probs = [] 235 | selected_labels = [] 236 | 237 | TARGET_VARIANCES = np.array([0.1, 0.1, 0.2, 0.2]) 238 | 239 | # For each class, take the proposals with the class-specific 240 | # predictions (class scores and bbox regression) and filter accordingly 241 | # (valid area, min probability score and NMS). 242 | for class_id in range(num_classes): 243 | # Apply the class-specific transformations to the proposals to 244 | # obtain the current class' prediction. 245 | class_prob = cls_prob[:, class_id + 1] # 0 is background class. 246 | class_bboxes = bbox_pred[:, (4 * class_id):(4 * class_id + 4)] 247 | raw_class_objects = decode( 248 | proposals, 249 | class_bboxes * TARGET_VARIANCES, 250 | ) 251 | 252 | # Clip bboxes so they don't go out of the image. 253 | class_objects = clip_boxes(raw_class_objects, im_shape) 254 | 255 | # Filter objects based on the min probability threshold and on them 256 | # having a valid area. 257 | prob_filter = tf.greater_equal(class_prob, min_prob_threshold) 258 | 259 | (x_min, y_min, x_max, y_max) = tf.unstack(class_objects, axis=1) 260 | area_filter = tf.greater( 261 | tf.maximum(x_max - x_min, 0.0) 262 | * tf.maximum(y_max - y_min, 0.0), 263 | 0.0 264 | ) 265 | 266 | object_filter = tf.logical_and(area_filter, prob_filter) 267 | 268 | class_objects = tf.boolean_mask(class_objects, object_filter) 269 | class_prob = tf.boolean_mask(class_prob, object_filter) 270 | 271 | # We have to use the TensorFlow's bounding box convention to use 272 | # the included function for NMS. 273 | class_objects_tf = change_order(class_objects) 274 | 275 | # Apply class NMS. 276 | class_selected_idx = tf.image.non_max_suppression( 277 | class_objects_tf, class_prob, class_max_detections, 278 | iou_threshold=CLASS_NMS_THRESHOLD 279 | ) 280 | 281 | # Using NMS resulting indices, gather values from Tensors. 282 | class_objects_tf = tf.gather(class_objects_tf, class_selected_idx) 283 | class_prob = tf.gather(class_prob, class_selected_idx) 284 | 285 | # Revert to our bbox convention. 286 | class_objects = change_order(class_objects_tf) 287 | 288 | # We append values to a regular list which will later be 289 | # transformed to a proper Tensor. 290 | selected_boxes.append(class_objects) 291 | selected_probs.append(class_prob) 292 | # In the case of the class_id, since it is a loop on classes, we 293 | # already have a fixed class_id. We use `tf.tile` to create that 294 | # Tensor with the total number of indices returned by the NMS. 295 | selected_labels.append( 296 | tf.tile([class_id], [tf.shape(class_selected_idx)[0]]) 297 | ) 298 | 299 | # We use concat (axis=0) to generate a Tensor where the rows are 300 | # stacked on top of each other 301 | objects = tf.concat(selected_boxes, axis=0) 302 | proposal_label = tf.concat(selected_labels, axis=0) 303 | proposal_label_prob = tf.concat(selected_probs, axis=0) 304 | 305 | # Get top-k detections of all classes. 306 | k = tf.minimum( 307 | TOTAL_MAX_DETECTIONS, 308 | tf.shape(proposal_label_prob)[0] 309 | ) 310 | 311 | top_k = tf.nn.top_k(proposal_label_prob, k=k) 312 | top_k_proposal_label_prob = top_k.values 313 | top_k_objects = tf.gather(objects, top_k.indices) 314 | top_k_proposal_label = tf.gather(proposal_label, top_k.indices) 315 | 316 | return ( 317 | top_k_objects, 318 | top_k_proposal_label, 319 | top_k_proposal_label_prob, 320 | ) 321 | -------------------------------------------------------------------------------- /workshop/image.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | 4 | from PIL import Image 5 | 6 | 7 | def open_all_images(path): 8 | """Opens all images located in `path`. 9 | 10 | Returns: 11 | Dictionary that maps the base name (with no extension) to the 12 | `numpy.ndarray` corresponding to the image. 13 | """ 14 | images = {} 15 | for filename in os.listdir(path): 16 | curr_path = os.path.join(path, filename) 17 | if not os.path.isfile(curr_path): 18 | continue 19 | 20 | name, _ = os.path.splitext(os.path.basename(curr_path)) 21 | images[name] = open_image(curr_path) 22 | 23 | return images 24 | 25 | 26 | def open_image(path): 27 | path = os.path.expanduser(path) 28 | raw_image = Image.open(path) 29 | image = np.expand_dims(raw_image.convert('RGB'), axis=0) 30 | return image 31 | 32 | 33 | def to_image(image_array): 34 | return Image.fromarray(np.squeeze(image_array, axis=0)) 35 | -------------------------------------------------------------------------------- /workshop/io.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tryolabs/object-detection-workshop/7e002c649daa673dd17e72826921bd0323d1eecf/workshop/io.py -------------------------------------------------------------------------------- /workshop/resnet.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | from collections import namedtuple 4 | 5 | 6 | class Block(namedtuple('Block', ['scope', 'unit_fn', 'args'])): 7 | """A named tuple describing a ResNet block. 8 | 9 | Its parts are: 10 | scope: The scope of the `Block`. 11 | unit_fn: The ResNet unit function which takes as input a `Tensor` and 12 | returns another `Tensor` with the output of the ResNet unit. 13 | args: A list of length equal to the number of units in the `Block`. The 14 | list contains one (depth, depth_bottleneck, stride) tuple for each 15 | unit in the block to serve as argument to unit_fn. 16 | """ 17 | 18 | 19 | def subsample(inputs, factor, scope=None): 20 | """Subsamples the input along the spatial dimensions. 21 | 22 | Args: 23 | inputs: A `Tensor` of size [batch, height_in, width_in, channels]. 24 | factor: The subsampling factor. 25 | scope: Optional variable_scope. 26 | 27 | Returns: 28 | output: A `Tensor` of size [batch, height_out, width_out, channels] with 29 | the input, either intact (if factor == 1) or subsampled (if factor > 30 | 1). 31 | """ 32 | if factor == 1: 33 | return inputs 34 | 35 | with tf.variable_scope(scope): 36 | return tf.layers.max_pooling2d( 37 | inputs, [1, 1], strides=factor, padding='same' 38 | ) 39 | 40 | 41 | def bottleneck(inputs, depth, depth_bottleneck, stride, rate=1, 42 | outputs_collections=None, scope=None): 43 | """Bottleneck residual unit variant with BN after convolutions. 44 | 45 | This is the original residual unit proposed in [1]. See Fig. 1(a) of [2] 46 | for its definition. Note that we use here the bottleneck variant which has 47 | an extra bottleneck layer. 48 | 49 | When putting together two consecutive ResNet blocks that use this unit, one 50 | should use stride = 2 in the last unit of the first block. 51 | 52 | Args: 53 | inputs: A tensor of size [batch, height, width, channels]. 54 | depth: The depth of the ResNet unit output. 55 | depth_bottleneck: The depth of the bottleneck layers. 56 | stride: The ResNet unit's stride. Determines the amount of downsampling 57 | of the units output compared to its input. 58 | rate: An integer, rate for atrous convolution. 59 | outputs_collections: Collection to add the ResNet unit output. 60 | scope: Optional variable_scope. 61 | 62 | Returns: 63 | The ResNet unit's output. 64 | """ 65 | with tf.variable_scope(scope, 'bottleneck_v1', [inputs]) as scope: 66 | depth_in = inputs.get_shape()[-1].value 67 | if depth == depth_in: 68 | shortcut = subsample(inputs, stride, 'shortcut') 69 | else: 70 | with tf.variable_scope('shortcut'): 71 | pre_shortcut = tf.layers.conv2d( 72 | inputs, depth, [1, 1], strides=stride, use_bias=False, 73 | padding='same', 74 | ) 75 | shortcut = tf.layers.batch_normalization( 76 | pre_shortcut, momentum=0.997, epsilon=1e-5, training=False, 77 | fused=False 78 | ) 79 | 80 | with tf.variable_scope('conv1'): 81 | residual = tf.layers.conv2d( 82 | inputs, depth_bottleneck, [1, 1], strides=1, use_bias=False, 83 | padding='same', 84 | ) 85 | residual = tf.layers.batch_normalization( 86 | residual, momentum=0.997, epsilon=1e-5, training=False, 87 | fused=False 88 | ) 89 | residual = tf.nn.relu(residual) 90 | 91 | with tf.variable_scope('conv2'): 92 | residual = conv2d_same( 93 | residual, depth_bottleneck, 3, strides=stride, 94 | dilation_rate=rate, 95 | ) 96 | residual = tf.layers.batch_normalization( 97 | residual, momentum=0.997, epsilon=1e-5, training=False, 98 | fused=False 99 | ) 100 | residual = tf.nn.relu(residual) 101 | 102 | with tf.variable_scope('conv3'): 103 | residual = tf.layers.conv2d( 104 | residual, depth, [1, 1], strides=1, use_bias=False, 105 | padding='same', 106 | ) 107 | residual = tf.layers.batch_normalization( 108 | residual, momentum=0.997, epsilon=1e-5, training=False, 109 | fused=False 110 | ) 111 | 112 | output = tf.nn.relu(shortcut + residual) 113 | 114 | return output 115 | 116 | 117 | def resnet_v1_block(scope, base_depth, num_units, stride): 118 | """Helper function for creating a resnet_v1 bottleneck block. 119 | 120 | Args: 121 | scope: The scope of the block. 122 | base_depth: The depth of the bottleneck layer for each unit. 123 | num_units: The number of units in the block. 124 | stride: The stride of the block, implemented as a stride in the last 125 | unit. All other units have stride=1. 126 | 127 | Returns: 128 | A resnet_v1 bottleneck block. 129 | """ 130 | return Block(scope, bottleneck, [{ 131 | 'depth': base_depth * 4, 132 | 'depth_bottleneck': base_depth, 133 | 'stride': 1 134 | }] * (num_units - 1) + [{ 135 | 'depth': base_depth * 4, 136 | 'depth_bottleneck': base_depth, 137 | 'stride': stride 138 | }]) 139 | 140 | 141 | def conv2d_same(inputs, filters, kernel_size, strides, dilation_rate=1): 142 | """Strided 2-D convolution with 'SAME' padding. 143 | 144 | When stride > 1, then we do explicit zero-padding, followed by conv2d with 145 | 'VALID' padding. 146 | 147 | Note that 148 | 149 | net = conv2d_same(inputs, num_outputs, 3, stride=stride) 150 | 151 | is equivalent to 152 | 153 | net = tf.contrib.layers.conv2d(inputs, num_outputs, 3, stride=1, 154 | padding='SAME') 155 | net = subsample(net, factor=stride) 156 | 157 | whereas 158 | 159 | net = tf.contrib.layers.conv2d(inputs, num_outputs, 3, stride=stride, 160 | padding='SAME') 161 | 162 | is different when the input's height or width is even, which is why we add 163 | the current function. For more details, see 164 | ResnetUtilsTest.testConv2DSameEven(). 165 | 166 | Args: 167 | inputs: A 4-D tensor of size [batch, height_in, width_in, channels]. 168 | filters: An integer, the number of output filters. 169 | kernel_size: An int with the kernel_size of the filters. 170 | strides: An integer, the output strides. 171 | dilation_rate: An integer, rate for atrous convolution. 172 | 173 | Returns: 174 | output: A 4-D tensor of size [batch, height_out, width_out, channels] 175 | with the convolution output. 176 | """ 177 | if strides == 1: 178 | return tf.layers.conv2d( 179 | inputs, filters, kernel_size, strides=1, use_bias=False, 180 | dilation_rate=dilation_rate, padding='same', 181 | ) 182 | else: 183 | kernel_size_effective = ( 184 | kernel_size + (kernel_size - 1) * (dilation_rate - 1) 185 | ) 186 | pad_total = kernel_size_effective - 1 187 | pad_beg = pad_total // 2 188 | pad_end = pad_total - pad_beg 189 | inputs = tf.pad( 190 | inputs, [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]] 191 | ) 192 | return tf.layers.conv2d( 193 | inputs, filters, kernel_size, strides=strides, use_bias=False, 194 | dilation_rate=dilation_rate, padding='valid', 195 | ) 196 | 197 | 198 | def stack_blocks_dense(net, blocks, output_stride=None): 199 | """Stacks ResNet `Blocks` and controls output feature density. 200 | 201 | First, this function creates scopes for the ResNet in the form of 202 | 'block_name/unit_1', 'block_name/unit_2', etc. 203 | 204 | Second, this function allows the user to explicitly control the ResNet 205 | output_stride, which is the ratio of the input to output spatial 206 | resolution. This is useful for dense prediction tasks such as semantic 207 | segmentation or object detection. 208 | 209 | Most ResNets consist of 4 ResNet blocks and subsample the activations by a 210 | factor of 2 when transitioning between consecutive ResNet blocks. This 211 | results to a nominal ResNet output_stride equal to 8. If we set the 212 | output_stride to half the nominal network stride (e.g., output_stride=4), 213 | then we compute responses twice. 214 | 215 | Control of the output feature density is implemented by atrous convolution. 216 | 217 | Args: 218 | net: A `Tensor` of size [batch, height, width, channels]. 219 | blocks: A list of length equal to the number of ResNet `Blocks`. Each 220 | element is a ResNet `Block` object describing the units in the `Block`. 221 | output_stride: If `None`, then the output will be computed at the nominal 222 | network stride. If output_stride is not `None`, it specifies the 223 | requested ratio of input to output spatial resolution, which needs to 224 | be equal to the product of unit strides from the start up to some level 225 | of the ResNet. For example, if the ResNet employs units with strides 226 | 1, 2, 1, 3, 4, 1, then valid values for the output_stride are 1, 2, 6, 227 | 24 or None (which is equivalent to output_stride=24). 228 | 229 | Returns: 230 | net: Output tensor with stride equal to the specified output_stride. 231 | endpoints: A dictionary from components of the network to the 232 | corresponding activation. 233 | 234 | Raises: 235 | ValueError: If the target output_stride is not valid. 236 | """ 237 | # The current_stride variable keeps track of the effective stride of the 238 | # activations. This allows us to invoke atrous convolution whenever 239 | # applying the next residual unit would result in the activations having 240 | # stride larger than the target output_stride. 241 | current_stride = 1 242 | 243 | # The atrous convolution rate parameter. 244 | rate = 1 245 | 246 | endpoints_collection = {} 247 | 248 | for block in blocks: 249 | with tf.variable_scope(block.scope, 'block', [net]) as scope: 250 | for i, unit in enumerate(block.args): 251 | if (output_stride is not None 252 | and current_stride > output_stride): 253 | raise ValueError( 254 | 'The target output_stride cannot be reached.' 255 | ) 256 | 257 | with tf.variable_scope('unit_%d' % (i + 1), values=[net]): 258 | # If we have reached the target output_stride, then we need 259 | # to employ atrous convolution with stride=1 and multiply 260 | # the atrous rate by the current unit's stride for use in 261 | # subsequent layers. 262 | if (output_stride is not None 263 | and current_stride == output_stride): 264 | net = block.unit_fn( 265 | net, rate=rate, **dict(unit, stride=1) 266 | ) 267 | rate *= unit.get('stride', 1) 268 | 269 | else: 270 | net = block.unit_fn(net, rate=1, **unit) 271 | current_stride *= unit.get('stride', 1) 272 | 273 | # Add output of each block to the endpoints collection. 274 | endpoints_collection[scope.name] = net 275 | 276 | if output_stride is not None and current_stride != output_stride: 277 | raise ValueError('The target output_stride cannot be reached.') 278 | 279 | return net, endpoints_collection 280 | 281 | 282 | def resnet_v1(inputs, blocks, training=True, global_pool=True, 283 | output_stride=None, include_root_block=True, reuse=None, 284 | scope=None): 285 | """Generator for v1 ResNet models. 286 | 287 | This function generates a family of ResNet v1 models. See the resnet_v1_*() 288 | methods for specific model instantiations, obtained by selecting different 289 | block instantiations that produce ResNets of various depths. 290 | 291 | Training for image classification on Imagenet is usually done with [224, 292 | 224] inputs, resulting in [7, 7] feature maps at the output of the last 293 | ResNet block for the ResNets defined in [1] that have nominal stride equal 294 | to 32. However, for dense prediction tasks we advise that one uses inputs 295 | with spatial dimensions that are multiples of 32 plus 1, e.g., [321, 296 | 321]. In this case the feature maps at the ResNet output will have spatial 297 | shape [(height - 1) / output_stride + 1, (width - 1) / output_stride + 1] 298 | and corners exactly aligned with the input image corners, which greatly 299 | facilitates alignment of the features to the image. Using as input [225, 300 | 225] images results in [8, 8] feature maps at the output of the last ResNet 301 | block. 302 | 303 | For dense prediction tasks, the ResNet needs to run in fully-convolutional 304 | (FCN) mode and global_pool needs to be set to False. The ResNets in [1, 2] 305 | all have nominal stride equal to 32 and a good choice in FCN mode is to use 306 | output_stride=16 in order to increase the density of the computed features 307 | at small computational and memory overhead, 308 | cf. http://arxiv.org/abs/1606.00915. 309 | 310 | Args: 311 | inputs: A tensor of size [batch, height_in, width_in, channels]. 312 | blocks: A list of length equal to the number of ResNet blocks. Each 313 | element is a resnet_utils.Block object describing the units in the 314 | block. 315 | training: whether batch_norm layers are in training mode. 316 | global_pool: If True, we perform global average pooling before computing 317 | the logits. Set to True for image classification, False for dense 318 | prediction. 319 | output_stride: If None, then the output will be computed at the nominal 320 | network stride. If output_stride is not None, it specifies the 321 | requested ratio of input to output spatial resolution. 322 | include_root_block: If True, include the initial convolution followed by 323 | max-pooling, if False excludes it. 324 | reuse: whether or not the network and its variables should be reused. To 325 | be able to reuse 'scope' must be given. 326 | scope: Optional variable_scope. 327 | 328 | Returns: 329 | net: A rank-4 tensor of size [batch, height_out, width_out, 330 | channels_out]. If global_pool is False, then height_out and width_out 331 | are reduced by a factor of output_stride compared to the respective 332 | height_in and width_in, else both height_out and width_out equal 333 | one. `net` is the output of the last ResNet block, potentially after 334 | global average pooling. 335 | endpoints: A dictionary from components of the network to the 336 | corresponding activation. 337 | 338 | Raises: 339 | ValueError: If the target output_stride is not valid. 340 | """ 341 | with tf.variable_scope(scope, 'resnet_v1', [inputs], reuse=reuse): 342 | net = inputs 343 | 344 | if include_root_block: 345 | if output_stride is not None: 346 | if output_stride % 4 != 0: 347 | raise ValueError( 348 | 'The output_stride needs to be a multiple of 4.' 349 | ) 350 | output_stride /= 4 351 | 352 | with tf.variable_scope('conv1'): 353 | net = conv2d_same(net, 64, 7, strides=2) 354 | net = tf.layers.batch_normalization( 355 | net, momentum=0.997, epsilon=1e-5, training=False, 356 | fused=False 357 | ) 358 | net = tf.nn.relu(net) 359 | 360 | with tf.variable_scope('pool1'): 361 | net = tf.layers.max_pooling2d(net, [3, 3], strides=2) 362 | 363 | net, endpoints = stack_blocks_dense(net, blocks, output_stride) 364 | 365 | if global_pool: 366 | # Global average pooling. 367 | net = tf.reduce_mean( 368 | net, [1, 2], name='pool5', keepdims=True 369 | ) 370 | 371 | return net, endpoints 372 | 373 | 374 | resnet_v1.default_image_size = 224 375 | 376 | 377 | def resnet_v1_101(inputs, training=True, global_pool=True, 378 | output_stride=None, reuse=None, scope='resnet_v1_101'): 379 | 380 | blocks = [ 381 | resnet_v1_block('block1', base_depth=64, num_units=3, stride=2), 382 | resnet_v1_block('block2', base_depth=128, num_units=4, stride=2), 383 | resnet_v1_block('block3', base_depth=256, num_units=23, stride=2), 384 | resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), 385 | ] 386 | 387 | return resnet_v1( 388 | inputs, 389 | blocks, 390 | training, 391 | global_pool, 392 | output_stride, 393 | include_root_block=True, 394 | reuse=reuse, 395 | scope=scope, 396 | ) 397 | 398 | 399 | def resnet_v1_101_tail(inputs, scope='resnet_v1_101'): 400 | blocks = [ 401 | resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), 402 | ] 403 | 404 | return resnet_v1( 405 | inputs, blocks, global_pool=False, training=False, 406 | include_root_block=False, scope=scope, 407 | ) 408 | -------------------------------------------------------------------------------- /workshop/vis.py: -------------------------------------------------------------------------------- 1 | import math 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | import sys 5 | 6 | from ipywidgets import IntSlider, Layout 7 | from matplotlib.patches import Rectangle 8 | from PIL import Image, ImageDraw, ImageFont 9 | 10 | from workshop.image import to_image 11 | 12 | 13 | # 14 | # Jupyter notebook related functions. 15 | # 16 | 17 | def pager(count, per_page, description='', **kwargs): 18 | 19 | slider_kwargs = { 20 | 'min': 0, 21 | 'max': (count - 1) // per_page, 22 | 'value': 0, 23 | 'description': description, 24 | 'layout': Layout(width='600px'), 25 | 'continuous_update': False, 26 | 'style': {'description_width': 'initial'}, 27 | } 28 | slider_kwargs.update(**kwargs) 29 | 30 | slider = IntSlider(**slider_kwargs) 31 | 32 | return slider 33 | 34 | 35 | def image_grid(count, columns=4, sizes=(5, 3)): 36 | rows = int(math.ceil(count / columns)) 37 | 38 | width, height = sizes 39 | 40 | figsize = (columns * width, rows * height) 41 | fig, axes = plt.subplots(rows, columns, figsize=figsize) 42 | 43 | # Default configuration for each axis. 44 | for ax in axes.ravel(): 45 | ax.axis('off') 46 | 47 | return axes.ravel() 48 | 49 | 50 | def vis_anchors(anchors): 51 | _, ax = plt.subplots(1, figsize=(10, 10)) 52 | 53 | ax.set_xlim([-500, 500]) 54 | ax.set_ylim([-500, 500]) 55 | 56 | for idx in range(anchors.shape[0]): 57 | add_rectangle(ax, anchors[idx, :]) 58 | 59 | plt.show() 60 | 61 | 62 | def add_rectangle(ax, coords, **kwargs): 63 | x_min, y_min, x_max, y_max = coords 64 | ax.add_patch( 65 | Rectangle( 66 | (x_min, y_min), x_max - x_min, y_max - y_min, 67 | linewidth=1, edgecolor='#dc3912', facecolor='none', 68 | **kwargs 69 | ) 70 | ) 71 | return ax 72 | 73 | 74 | # 75 | # Bounding box related functions. 76 | # 77 | 78 | def get_font(): 79 | """Attempts to retrieve a reasonably-looking TTF font from the system. 80 | 81 | We don't make much of an effort, but it's what we can reasonably do without 82 | incorporating additional dependencies for this task. 83 | """ 84 | if sys.platform == 'win32': 85 | font_names = ['Arial'] 86 | elif sys.platform in ['linux', 'linux2']: 87 | font_names = ['DejaVuSans-Bold', 'DroidSans-Bold'] 88 | elif sys.platform == 'darwin': 89 | font_names = ['Menlo', 'Helvetica'] 90 | 91 | font = None 92 | for font_name in font_names: 93 | try: 94 | font = ImageFont.truetype(font_name) 95 | break 96 | except IOError: 97 | continue 98 | 99 | return font 100 | 101 | 102 | SYSTEM_FONT = get_font() 103 | 104 | 105 | def hex_to_rgb(x): 106 | """Turns a color hex representation into a tuple representation.""" 107 | return tuple([int(x[i:i + 2], 16) for i in (0, 2, 4)]) 108 | 109 | 110 | def build_colormap(): 111 | """Builds a colormap function that maps labels to colors. 112 | 113 | Returns: 114 | Function that receives a label and returns a color tuple `(R, G, B)` 115 | for said label. 116 | """ 117 | # Build the 10-color palette to be used for all classes. The following are 118 | # the hex-codes for said colors (taken the default 10-categorical d3 color 119 | # palette). 120 | palette = ( 121 | '1f77b4ff7f0e2ca02cd627289467bd8c564be377c27f7f7fbcbd2217becf' 122 | ) 123 | colors = [hex_to_rgb(palette[i:i + 6]) for i in range(0, len(palette), 6)] 124 | 125 | seen_labels = {} 126 | 127 | def colormap(label): 128 | # If label not yet seen, get the next value in the palette sequence. 129 | if label not in seen_labels: 130 | seen_labels[label] = colors[len(seen_labels) % len(colors)] 131 | 132 | return seen_labels[label] 133 | 134 | return colormap 135 | 136 | 137 | def draw_rectangle(draw, coordinates, color, width=1, fill=30): 138 | """Draw a rectangle with an optional width.""" 139 | # Add alphas to the color so we have a small overlay over the object. 140 | fill = color + (fill,) 141 | outline = color + (255,) 142 | 143 | # Pillow doesn't support width in rectangles, so we must emulate it with a 144 | # loop. 145 | for i in range(width): 146 | coords = [ 147 | coordinates[0] - i, 148 | coordinates[1] - i, 149 | coordinates[2] + i, 150 | coordinates[3] + i, 151 | ] 152 | 153 | # Fill must be drawn only for the first rectangle, or the alphas will 154 | # add up. 155 | if i == 0: 156 | draw.rectangle(coords, fill=fill, outline=outline) 157 | else: 158 | draw.rectangle(coords, outline=outline) 159 | 160 | 161 | def draw_label(draw, coords, label, prob, color, scale=1): 162 | """Draw a box with the label and probability.""" 163 | # Attempt to get a native TTF font. If not, use the default bitmap font. 164 | global SYSTEM_FONT 165 | if SYSTEM_FONT: 166 | label_font = SYSTEM_FONT.font_variant(size=int(round(16 * scale))) 167 | prob_font = SYSTEM_FONT.font_variant(size=int(round(12 * scale))) 168 | else: 169 | label_font = ImageFont.load_default() 170 | prob_font = ImageFont.load_default() 171 | 172 | label = str(label) # `label` may not be a string. 173 | prob = '({:.2f})'.format(prob) # Turn `prob` into a string. 174 | 175 | # We want the probability font to be smaller, so we'll write the label in 176 | # two steps. 177 | label_w, label_h = label_font.getsize(label) 178 | prob_w, prob_h = prob_font.getsize(prob) 179 | 180 | # Get margins to manually adjust the spacing. The margin goes between each 181 | # segment (i.e. margin, label, margin, prob, margin). 182 | margin_w, margin_h = label_font.getsize('M') 183 | margin_w *= 0.2 184 | _, full_line_height = label_font.getsize('Mq') 185 | 186 | # Draw the background first, considering all margins and the full line 187 | # height. 188 | background_coords = [ 189 | coords[0], 190 | coords[1], 191 | coords[0] + label_w + prob_w + 3 * margin_w, 192 | coords[1] + full_line_height * 1.15, 193 | ] 194 | draw.rectangle(background_coords, fill=color + (255,)) 195 | 196 | # Then write the two pieces of text. 197 | draw.text([ 198 | coords[0] + margin_w, 199 | coords[1], 200 | ], label, font=label_font) 201 | 202 | draw.text([ 203 | coords[0] + label_w + 2 * margin_w, 204 | coords[1] + (margin_h - prob_h), 205 | ], prob, font=prob_font) 206 | 207 | 208 | def draw_bboxes(image_array, objects): 209 | # Receives a numpy array. Translate into a PIL image. 210 | image = to_image(image_array) 211 | 212 | # Open as 'RGBA' in order to draw translucent boxes. 213 | draw = ImageDraw.Draw(image, 'RGBA') 214 | for obj in objects: 215 | color = (220, 57, 18) 216 | draw_rectangle(draw, obj, color, width=2, fill=0) 217 | 218 | return image 219 | 220 | 221 | def vis_objects(image, objects, colormap=None, labels=True, scale=1, fill=30): 222 | """Visualize objects as returned by `Detector`. 223 | 224 | Arguments: 225 | image (numpy.ndarray): Image to draw the bounding boxes on. 226 | objects (list of dicts or dict): List of objects as returned by a 227 | `Detector` instance. 228 | colormap (function): Colormap function to use for the objects. 229 | labels (boolean): Whether to draw labels. 230 | scale (float): Scale factor for the box sizes, which will enlarge or 231 | shrink the width of the boxes and the fonts. 232 | fill (int): Integer between 0..255 to use as fill for the bounding 233 | boxes. 234 | 235 | Returns: 236 | A PIL image with the detected objects' bounding boxes and labels drawn. 237 | Can be casted to a `numpy.ndarray` by using `numpy.array` on the 238 | returned object. 239 | """ 240 | if not isinstance(objects, list): 241 | objects = [objects] 242 | 243 | if colormap is None: 244 | colormap = build_colormap() 245 | 246 | image = Image.fromarray(image.astype(np.uint8)) 247 | 248 | draw = ImageDraw.Draw(image, 'RGBA') 249 | for obj in objects: 250 | color = colormap(obj['label']) 251 | draw_rectangle( 252 | draw, obj['bbox'], color, width=round(3 * scale), fill=fill 253 | ) 254 | if labels: 255 | draw_label( 256 | draw, obj['bbox'][:2], obj['label'], obj['prob'], color, 257 | scale=scale 258 | ) 259 | 260 | return image 261 | 262 | 263 | def draw_bboxes_with_labels(image_array, classes, objects, preds, probs): 264 | # Receives a numpy array. Translate into a PIL image. 265 | image = to_image(image_array) 266 | 267 | colormap = build_colormap() 268 | 269 | # Open as 'RGBA' in order to draw translucent boxes. 270 | draw = ImageDraw.Draw(image, 'RGBA') 271 | for obj, pred, prob in zip(objects, preds, probs): 272 | label = classes[pred] 273 | color = colormap(label) 274 | draw_rectangle(draw, obj, color, width=3) 275 | draw_label(draw, obj[:2], label, prob, color) 276 | 277 | return image 278 | --------------------------------------------------------------------------------