├── LICENSE
├── README.md
├── README_zh.md
├── create_data_lists.py
├── datasets.py
├── detect.py
├── eval.py
├── img
├── 000001.jpg
├── 000022.jpg
├── 000029.jpg
├── 000045.jpg
├── 000062.jpg
├── 000069.jpg
├── 000075.jpg
├── 000082.jpg
├── 000085.jpg
├── 000092.jpg
├── 000098.jpg
├── 000100.jpg
├── 000116.jpg
├── 000124.jpg
├── 000127.jpg
├── 000128.jpg
├── 000139.jpg
├── 000144.jpg
├── 000145.jpg
├── auxconv.jpg
├── baseball.gif
├── bc1.PNG
├── bc2.PNG
├── confloss.jpg
├── cs.PNG
├── ecs1.PNG
├── ecs2.PNG
├── fcconv1.jpg
├── fcconv2.jpg
├── fcconv3.jpg
├── fcconv4.jpg
├── incomplete.jpg
├── jaccard.jpg
├── locloss.jpg
├── matching1.PNG
├── matching2.jpg
├── modifiedvgg.PNG
├── nms1.PNG
├── nms2.PNG
├── nms3.jpg
├── nms4.PNG
├── predconv1.jpg
├── predconv2.jpg
├── predconv3.jpg
├── predconv4.jpg
├── priors1.jpg
├── priors2.jpg
├── reshaping1.jpg
├── reshaping2.jpg
├── totalloss.jpg
├── vgg16.PNG
├── wh1.jpg
└── wh2.jpg
├── model.py
├── train.py
└── utils.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Sagar Vinodababu
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | This is a **[PyTorch](https://pytorch.org) Tutorial to Object Detection**.
2 |
3 | This is the third in [a series of tutorials](https://github.com/sgrvinod/Deep-Tutorials-for-PyTorch) I'm writing about _implementing_ cool models on your own with the amazing PyTorch library.
4 |
5 | Basic knowledge of PyTorch, convolutional neural networks is assumed.
6 |
7 | If you're new to PyTorch, first read [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) and [Learning PyTorch with Examples](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html).
8 |
9 | Questions, suggestions, or corrections can be posted as issues.
10 |
11 | I'm using `PyTorch 0.4` in `Python 3.6`.
12 |
13 | ---
14 |
15 | **04 Nov 2023**: 中文翻译 – a Chinese translation of this tutorial has been kindly made available by [@zigerZZZ](https://github.com/zigerZZZ) – see [README_zh.md](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/README_zh.md).
16 |
17 | ---
18 |
19 | # Contents
20 |
21 | [***Objective***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#objective)
22 |
23 | [***Concepts***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#concepts)
24 |
25 | [***Overview***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#overview)
26 |
27 | [***Implementation***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#implementation)
28 |
29 | [***Training***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#training)
30 |
31 | [***Evaluation***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#evaluation)
32 |
33 | [***Inference***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#inference)
34 |
35 | [***Frequently Asked Questions***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#faqs)
36 |
37 | # Objective
38 |
39 | **To build a model that can detect and localize specific objects in images.**
40 |
41 |
42 |
43 |
44 |
45 | We will be implementing the [Single Shot Multibox Detector (SSD)](https://arxiv.org/abs/1512.02325), a popular, powerful, and especially nimble network for this task. The authors' original implementation can be found [here](https://github.com/weiliu89/caffe/tree/ssd).
46 |
47 | Here are some examples of object detection in images not seen during training –
48 |
49 | ---
50 |
51 |
52 |
53 |
54 |
55 | ---
56 |
57 |
58 |
59 |
60 |
61 | ---
62 |
63 |
64 |
65 |
66 |
67 | ---
68 |
69 |
70 |
71 |
72 |
73 | ---
74 |
75 |
76 |
77 |
78 |
79 | ---
80 |
81 |
82 |
83 |
84 |
85 | ---
86 |
87 |
88 |
89 |
90 |
91 | ---
92 |
93 |
94 |
95 |
96 |
97 | ---
98 |
99 | There are more examples at the [end of the tutorial](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#some-more-examples).
100 |
101 | ---
102 |
103 | # Concepts
104 |
105 | * **Object Detection**. duh.
106 |
107 | * **Single-Shot Detection**. Earlier architectures for object detection consisted of two distinct stages – a region proposal network that performs object localization and a classifier for detecting the types of objects in the proposed regions. Computationally, these can be very expensive and therefore ill-suited for real-world, real-time applications. Single-shot models encapsulate both localization and detection tasks in a single forward sweep of the network, resulting in significantly faster detections while deployable on lighter hardware.
108 |
109 | * **Multiscale Feature Maps**. In image classification tasks, we base our predictions on the final convolutional feature map – the smallest but deepest representation of the original image. In object detection, feature maps from intermediate convolutional layers can also be _directly_ useful because they represent the original image at different scales. Therefore, a fixed-size filter operating on different feature maps will be able to detect objects of various sizes.
110 |
111 | * **Priors**. These are pre-computed boxes defined at specific positions on specific feature maps, with specific aspect ratios and scales. They are carefully chosen to match the characteristics of objects' bounding boxes (i.e. the ground truths) in the dataset.
112 |
113 | * **Multibox**. This is [a technique](https://arxiv.org/abs/1312.2249) that formulates predicting an object's bounding box as a _regression_ problem, wherein a detected object's coordinates are regressed to its ground truth's coordinates. In addition, for each predicted box, scores are generated for various object types. Priors serve as feasible starting points for predictions because they are modeled on the ground truths. Therefore, there will be as many predicted boxes as there are priors, most of whom will contain no object.
114 |
115 | * **Hard Negative Mining**. This refers to explicitly choosing the most egregious false positives predicted by a model and forcing it to learn from these examples. In other words, we are mining only those negatives that the model found _hardest_ to identify correctly. In the context of object detection, where the vast majority of predicted boxes do not contain an object, this also serves to reduce the negative-positive imbalance.
116 |
117 | * **Non-Maximum Suppression**. At any given location, multiple priors can overlap significantly. Therefore, predictions arising out of these priors could actually be duplicates of the same object. Non-Maximum Suppression (NMS) is a means to remove redundant predictions by suppressing all but the one with the maximum score.
118 |
119 | # Overview
120 |
121 | In this section, I will present an overview of this model. If you're already familiar with it, you can skip straight to the [Implementation](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#implementation) section or the commented code.
122 |
123 | As we proceed, you will notice that there's a fair bit of engineering that's resulted in the SSD's very specific structure and formulation. Don't worry if some aspects of it seem contrived or unspontaneous at first. Remember, it's built upon _years_ of (often empirical) research in this field.
124 |
125 | ### Some definitions
126 |
127 | A box is a box. A _bounding_ box is a box that wraps around an object i.e. represents its bounds.
128 |
129 | In this tutorial, we will encounter both types – just boxes and bounding boxes. But all boxes are represented on images and we need to be able to measure their positions, shapes, sizes, and other properties.
130 |
131 | #### Boundary coordinates
132 |
133 | The most obvious way to represent a box is by the pixel coordinates of the `x` and `y` lines that constitute its boundaries.
134 |
135 | 
136 |
137 | The boundary coordinates of a box are simply **`(x_min, y_min, x_max, y_max)`**.
138 |
139 | But pixel values are next to useless if we don't know the actual dimensions of the image.
140 | A better way would be to represent all coordinates is in their _fractional_ form.
141 |
142 | 
143 |
144 | Now the coordinates are size-invariant and all boxes across all images are measured on the same scale.
145 |
146 | #### Center-Size coordinates
147 |
148 | This is a more explicit way of representing a box's position and dimensions.
149 |
150 | 
151 |
152 | The center-size coordinates of a box are **`(c_x, c_y, w, h)`**.
153 |
154 | In the code, you will find that we routinely use both coordinate systems depending upon their suitability for the task, and _always_ in their fractional forms.
155 |
156 | #### Jaccard Index
157 |
158 | The Jaccard Index or Jaccard Overlap or Intersection-over-Union (IoU) measure the **degree or extent to which two boxes overlap**.
159 |
160 | 
161 |
162 | An IoU of `1` implies they are the _same_ box, while a value of `0` indicates they're mutually exclusive spaces.
163 |
164 | It's a simple metric, but also one that finds many applications in our model.
165 |
166 | ### Multibox
167 |
168 | Multibox is a technique for detecting objects where a prediction consists of two components –
169 |
170 | - **Coordinates of a box that may or may not contain an object**. This is a _regression_ task.
171 |
172 | - **Scores for various object types for this box**, including a _background_ class which implies there is no object in the box. This is a _classification_ task.
173 |
174 | ### Single Shot Detector (SSD)
175 |
176 | The SSD is a purely convolutional neural network (CNN) that we can organize into three parts –
177 |
178 | - __Base convolutions__ derived from an existing image classification architecture that will provide lower-level feature maps.
179 |
180 | - __Auxiliary convolutions__ added on top of the base network that will provide higher-level feature maps.
181 |
182 | - __Prediction convolutions__ that will locate and identify objects in these feature maps.
183 |
184 | The paper demonstrates two variants of the model called the SSD300 and the SSD512. The suffixes represent the size of the input image. Although the two networks differ slightly in the way they are constructed, they are in principle the same. The SSD512 is just a larger network and results in marginally better performance.
185 |
186 | For convenience, we will deal with the SSD300.
187 |
188 | ### Base Convolutions – part 1
189 |
190 | First of all, why use convolutions from an existing network architecture?
191 |
192 | Because models proven to work well with image classification are already pretty good at capturing the basic essence of an image. The same convolutional features are useful for object detection, albeit in a more _local_ sense – we're less interested in the image as a whole than specific regions of it where objects are present.
193 |
194 | There's also the added advantage of being able to use layers pretrained on a reliable classification dataset. As you may know, this is called **Transfer Learning**. By borrowing knowledge from a different but closely related task, we've made progress before we've even begun.
195 |
196 | The authors of the paper employ the **VGG-16 architecture** as their base network. It's rather simple in its original form.
197 |
198 | 
199 |
200 | They recommend using one that's pretrained on the _ImageNet Large Scale Visual Recognition Competition (ILSVRC)_ classification task. Luckily, there's one already available in PyTorch, as are other popular architectures. If you wish, you could opt for something larger like the ResNet. Just be mindful of the computational requirements.
201 |
202 | As per the paper, **we've to make some changes to this pretrained network** to adapt it to our own challenge of object detection. Some are logical and necessary, while others are mostly a matter of convenience or preference.
203 |
204 | - The **input image size** will be `300, 300`, as stated earlier.
205 |
206 | - The **3rd pooling layer**, which halves dimensions, will use the mathematical `ceiling` function instead of the default `floor` function in determining output size. This is significant only if the dimensions of the preceding feature map are odd and not even. By looking at the image above, you could calculate that for our input image size of `300, 300`, the `conv3_3` feature map will be of cross-section `75, 75`, which is halved to `38, 38` instead of an inconvenient `37, 37`.
207 |
208 | - We modify the **5th pooling layer** from a `2, 2` kernel and `2` stride to a `3, 3` kernel and `1` stride. The effect this has is it no longer halves the dimensions of the feature map from the preceding convolutional layer.
209 |
210 | - We don't need the fully connected (i.e. classification) layers because they serve no purpose here. We will toss `fc8` away completely, but choose to **_rework_ `fc6` and `fc7` into convolutional layers `conv6` and `conv7`**.
211 |
212 | The first three modifications are straightforward enough, but that last one probably needs some explaining.
213 |
214 | ### FC → Convolutional Layer
215 |
216 | How do we reparameterize a fully connected layer into a convolutional layer?
217 |
218 | Consider the following scenario.
219 |
220 | In the typical image classification setting, the first fully connected layer cannot operate on the preceding feature map or image _directly_. We'd need to flatten it into a 1D structure.
221 |
222 | 
223 |
224 | In this example, there's an image of dimensions `2, 2, 3`, flattened to a 1D vector of size `12`. For an output of size `2`, the fully connected layer computes two dot-products of this flattened image with two vectors of the same size `12`. **These two vectors, shown in gray, are the parameters of the fully connected layer.**
225 |
226 | Now, consider a different scenario where we use a convolutional layer to produce `2` output values.
227 |
228 | 
229 |
230 | Here, the image of dimensions `2, 2, 3` need not be flattened, obviously. The convolutional layer uses two filters with `12` elements in the same shape as the image to perform two dot products. **These two filters, shown in gray, are the parameters of the convolutional layer.**
231 |
232 | But here's the key part – **in both scenarios, the outputs `Y_0` and `Y_1` are the same!**
233 |
234 | 
235 |
236 | The two scenarios are equivalent.
237 |
238 | What does this tell us?
239 |
240 | That **on an image of size `H, W` with `I` input channels, a fully connected layer of output size `N` is equivalent to a convolutional layer with kernel size equal to the image size `H, W` and `N` output channels**, provided that the parameters of the fully connected network `N, H * W * I` are the same as the parameters of the convolutional layer `N, H, W, I`.
241 |
242 | 
243 |
244 | Therefore, any fully connected layer can be converted to an equivalent convolutional layer simply **by reshaping its parameters**.
245 |
246 | ### Base Convolutions – part 2
247 |
248 | We now know how to convert `fc6` and `fc7` in the original VGG-16 architecture into `conv6` and `conv7` respectively.
249 |
250 | In the ImageNet VGG-16 [shown previously](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#base-convolutions--part-1), which operates on images of size `224, 224, 3`, you can see that the output of `conv5_3` will be of size `7, 7, 512`. Therefore –
251 |
252 | - `fc6` with a flattened input size of `7 * 7 * 512` and an output size of `4096` has parameters of dimensions `4096, 7 * 7 * 512`. **The equivalent convolutional layer `conv6` has a `7, 7` kernel size and `4096` output channels, with reshaped parameters of dimensions `4096, 7, 7, 512`.**
253 |
254 | - `fc7` with an input size of `4096` (i.e. the output size of `fc6`) and an output size `4096` has parameters of dimensions `4096, 4096`. The input could be considered as a `1, 1` image with `4096` input channels. **The equivalent convolutional layer `conv7` has a `1, 1` kernel size and `4096` output channels, with reshaped parameters of dimensions `4096, 1, 1, 4096`.**
255 |
256 | We can see that `conv6` has `4096` filters, each with dimensions `7, 7, 512`, and `conv7` has `4096` filters, each with dimensions `1, 1, 4096`.
257 |
258 | These filters are numerous and large – and computationally expensive.
259 |
260 | To remedy this, the authors opt to **reduce both their number and the size of each filter by subsampling parameters** from the converted convolutional layers.
261 |
262 | - `conv6` will use `1024` filters, each with dimensions `3, 3, 512`. Therefore, the parameters are subsampled from `4096, 7, 7, 512` to `1024, 3, 3, 512`.
263 |
264 | - `conv7` will use `1024` filters, each with dimensions `1, 1, 1024`. Therefore, the parameters are subsampled from `4096, 1, 1, 4096` to `1024, 1, 1, 1024`.
265 |
266 | Based on the references in the paper, we will **subsample by picking every `m`th parameter along a particular dimension**, in a process known as [_decimation_](https://en.wikipedia.org/wiki/Downsampling_(signal_processing)).
267 |
268 | Since the kernel of `conv6` is decimated from `7, 7` to `3, 3` by keeping only every 3rd value, there are now _holes_ in the kernel. Therefore, we would need to **make the kernel dilated or _atrous_**.
269 |
270 | This corresponds to a dilation of `3` (same as the decimation factor `m = 3`). However, the authors actually use a dilation of `6`, possibly because the 5th pooling layer no longer halves the dimensions of the preceding feature map.
271 |
272 | We are now in a position to present our base network, **the modified VGG-16**.
273 |
274 | 
275 |
276 | In the above figure, pay special attention to the outputs of `conv4_3` and `conv_7`. You will see why soon enough.
277 |
278 | ### Auxiliary Convolutions
279 |
280 | We will now **stack some more convolutional layers on top of our base network**. These convolutions provide additional feature maps, each progressively smaller than the last.
281 |
282 | 
283 |
284 | We introduce four convolutional blocks, each with two layers. While size reduction happened through pooling in the base network, here it is facilitated by a stride of `2` in every second layer.
285 |
286 | Again, take note of the feature maps from `conv8_2`, `conv9_2`, `conv10_2`, and `conv11_2`.
287 |
288 | ### A detour
289 |
290 | Before we move on to the prediction convolutions, we must first understand what it is we are predicting. Sure, it's objects and their positions, _but in what form?_
291 |
292 | It is here that we must learn about _priors_ and the crucial role they play in the SSD.
293 |
294 | #### Priors
295 |
296 | Object predictions can be quite diverse, and I don't just mean their type. They can occur at any position, with any size and shape. Mind you, we shouldn't go as far as to say there are _infinite_ possibilities for where and how an object can occur. While this may be true mathematically, many options are simply improbable or uninteresting. Furthermore, we needn't insist that boxes are pixel-perfect.
297 |
298 | In effect, we can discretize the mathematical space of potential predictions into just _thousands_ of possibilities.
299 |
300 | **Priors are precalculated, fixed boxes which collectively represent this universe of probable and approximate box predictions**.
301 |
302 | Priors are manually but carefully chosen based on the shapes and sizes of ground truth objects in our dataset. By placing these priors at every possible location in a feature map, we also account for variety in position.
303 |
304 | In defining the priors, the authors specify that –
305 |
306 | - **they will be applied to various low-level and high-level feature maps**, viz. those from `conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2`, and `conv11_2`. These are the same feature maps indicated on the figures before.
307 |
308 | - **if a prior has a scale `s`, then its area is equal to that of a square with side `s`**. The largest feature map, `conv4_3`, will have priors with a scale of `0.1`, i.e. `10%` of image's dimensions, while the rest have priors with scales linearly increasing from `0.2` to `0.9`. As you can see, larger feature maps have priors with smaller scales and are therefore ideal for detecting smaller objects.
309 |
310 | - **At _each_ position on a feature map, there will be priors of various aspect ratios**. All feature maps will have priors with ratios `1:1, 2:1, 1:2`. The intermediate feature maps of `conv7`, `conv8_2`, and `conv9_2` will _also_ have priors with ratios `3:1, 1:3`. Moreover, all feature maps will have *one extra prior* with an aspect ratio of `1:1` and at a scale that is the geometric mean of the scales of the current and subsequent feature map.
311 |
312 | | Feature Map From | Feature Map Dimensions | Prior Scale | Aspect Ratios | Number of Priors per Position | Total Number of Priors on this Feature Map |
313 | | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: |
314 | | `conv4_3` | 38, 38 | 0.1 | 1:1, 2:1, 1:2 + an extra prior | 4 | 5776 |
315 | | `conv7` | 19, 19 | 0.2 | 1:1, 2:1, 1:2, 3:1, 1:3 + an extra prior | 6 | 2166 |
316 | | `conv8_2` | 10, 10 | 0.375 | 1:1, 2:1, 1:2, 3:1, 1:3 + an extra prior | 6 | 600 |
317 | | `conv9_2` | 5, 5 | 0.55 | 1:1, 2:1, 1:2, 3:1, 1:3 + an extra prior | 6 | 150 |
318 | | `conv10_2` | 3, 3 | 0.725 | 1:1, 2:1, 1:2 + an extra prior | 4 | 36 |
319 | | `conv11_2` | 1, 1 | 0.9 | 1:1, 2:1, 1:2 + an extra prior | 4 | 4 |
320 | | **Grand Total** | – | – | – | – | **8732 priors** |
321 |
322 | There are a total of 8732 priors defined for the SSD300!
323 |
324 | #### Visualizing Priors
325 |
326 | We defined the priors in terms of their _scales_ and _aspect ratios_.
327 |
328 | 
329 |
330 | Solving these equations yields a prior's dimensions `w` and `h`.
331 |
332 | 
333 |
334 | We're now in a position to draw them on their respective feature maps.
335 |
336 | For example, let's try to visualize what the priors will look like at the central tile of the feature map from `conv9_2`.
337 |
338 | 
339 |
340 | The same priors also exist for each of the other tiles.
341 |
342 | 
343 |
344 | #### Predictions vis-à-vis Priors
345 |
346 | [Earlier](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#multibox), we said we would use regression to find the coordinates of an object's bounding box. But then, surely, the priors can't represent our final predicted boxes?
347 |
348 | They don't.
349 |
350 | Again, I would like to reiterate that the priors represent, _approximately_, the possibilities for prediction.
351 |
352 | This means that **we use each prior as an approximate starting point and then find out how much it needs to be adjusted to obtain a more exact prediction for a bounding box**.
353 |
354 | So if each predicted bounding box is a slight deviation from a prior, and our goal is to calculate this deviation, we need a way to measure or quantify it.
355 |
356 | Consider a cat, its predicted bounding box, and the prior with which the prediction was made.
357 |
358 | 
359 |
360 | Assume they are represented in center-size coordinates, which we are familiar with.
361 |
362 | Then –
363 |
364 | 
365 |
366 | This answers the question we posed at the [beginning of this section](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#a-detour). Considering that each prior is adjusted to obtain a more precise prediction, **these four offsets `(g_c_x, g_c_y, g_w, g_h)` are the form in which we will regress bounding boxes' coordinates**.
367 |
368 | As you can see, each offset is normalized by the corresponding dimension of the prior. This makes sense because a certain offset would be less significant for a larger prior than it would be for a smaller prior.
369 |
370 | ### Prediction convolutions
371 |
372 | Earlier, we earmarked and defined priors for six feature maps of various scales and granularity, viz. those from `conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2`, and `conv11_2`.
373 |
374 | Then, **for _each_ prior at _each_ location on _each_ feature map**, we want to predict –
375 |
376 | - the **offsets `(g_c_x, g_c_y, g_w, g_h)`** for a bounding box.
377 |
378 | - a set of **`n_classes` scores** for the bounding box, where `n_classes` represents the total number of object types (including a _background_ class).
379 |
380 | To do this in the simplest manner possible, **we need two convolutional layers for each feature map** –
381 |
382 | - a **_localization_ prediction** convolutional layer with a `3, 3` kernel evaluating at each location (i.e. with padding and stride of `1`) with `4` filters for _each_ prior present at the location.
383 |
384 | The `4` filters for a prior calculate the four encoded offsets `(g_c_x, g_c_y, g_w, g_h)` for the bounding box predicted from that prior.
385 |
386 | - a **_class_ prediction** convolutional layer with a `3, 3` kernel evaluating at each location (i.e. with padding and stride of `1`) with `n_classes` filters for _each_ prior present at the location.
387 |
388 | The `n_classes` filters for a prior calculate a set of `n_classes` scores for that prior.
389 |
390 | 
391 |
392 | All our filters are applied with a kernel size of `3, 3`.
393 |
394 | We don't really need kernels (or filters) in the same shapes as the priors because the different filters will _learn_ to make predictions with respect to the different prior shapes.
395 |
396 | Let's take a look at the **outputs of these convolutions**. Consider again the feature map from `conv9_2`.
397 |
398 | 
399 |
400 | The outputs of the localization and class prediction layers are shown in blue and yellow respectively. You can see that the cross-section (`5, 5`) remains unchanged.
401 |
402 | What we're really interested in is the _third_ dimension, i.e. the channels. These contain the actual predictions.
403 |
404 | If you **choose a tile, _any_ tile, in the localization predictions and expand it**, what will you see?
405 |
406 | 
407 |
408 | Voilà! The channel values at each position of the localization predictions represent the encoded offsets with respect to the priors at that position.
409 |
410 | Now, **do the same with the class predictions.** Assume `n_classes = 3`.
411 |
412 | 
413 |
414 | Similar to before, these channels represent the class scores for the priors at that position.
415 |
416 | Now that we understand what the predictions for the feature map from `conv9_2` look like, we can **reshape them into a more amenable form.**
417 |
418 | 
419 |
420 | We have arranged the `150` predictions serially. To the human mind, this should appear more intuitive.
421 |
422 | But let's not stop here. We could do the same for the predictions for _all_ layers and stack them together.
423 |
424 | We calculated earlier that there are a total of 8732 priors defined for our model. Therefore, there will be **8732 predicted boxes in encoded-offset form, and 8732 sets of class scores**.
425 |
426 | 
427 |
428 | **This is the final output of the prediction stage.** A stack of boxes, if you will, and estimates for what's in them.
429 |
430 | It's all coming together, isn't it? If this is your first rodeo in object detection, I should think there's now a faint light at the end of the tunnel.
431 |
432 | ### Multibox loss
433 |
434 | Based on the nature of our predictions, it's easy to see why we might need a unique loss function. Many of us have calculated losses in regression or classification settings before, but rarely, if ever, _together_.
435 |
436 | Obviously, our total loss must be an **aggregate of losses from both types of predictions** – bounding box localizations and class scores.
437 |
438 | Then, there are a few questions to be answered –
439 |
440 | >_What loss function will be used for the regressed bounding boxes?_
441 |
442 | >_Will we use multiclass cross-entropy for the class scores?_
443 |
444 | >_In what ratio will we combine them?_
445 |
446 | >_How do we match predicted boxes to their ground truths?_
447 |
448 | >_We have 8732 predictions! Won't most of these contain no object? Do we even consider them?_
449 |
450 | Phew. Let's get to work.
451 |
452 | #### Matching predictions to ground truths
453 |
454 | Remember, the nub of any supervised learning algorithm is that **we need to be able to match predictions to their ground truths**. This is tricky since object detection is more open-ended than the average learning task.
455 |
456 | For the model to learn _anything_, we'd need to structure the problem in a way that allows for comparisons between our predictions and the objects actually present in the image.
457 |
458 | Priors enable us to do exactly this!
459 |
460 | - **Find the Jaccard overlaps** between the 8732 priors and `N` ground truth objects. This will be a tensor of size `8732, N`.
461 |
462 | - **Match** each of the 8732 priors to the object with which it has the greatest overlap.
463 |
464 | - If a prior is matched with an object with a **Jaccard overlap of less than `0.5`**, then it cannot be said to "contain" the object, and is therefore a **_negative_ match**. Considering we have thousands of priors, most priors will test negative for an object.
465 |
466 | - On the other hand, a handful of priors will actually **overlap significantly (greater than `0.5`)** with an object, and can be said to "contain" that object. These are **_positive_ matches**.
467 |
468 | - Now that we have **matched each of the 8732 priors to a ground truth**, we have, in effect, also **matched the corresponding 8732 predictions to a ground truth**.
469 |
470 | Let's reproduce this logic with an example.
471 |
472 | 
473 |
474 | For convenience, we will assume there are just seven priors, shown in red. The ground truths are in yellow – there are three actual objects in this image.
475 |
476 | Following the steps outlined earlier will yield the following matches –
477 |
478 | 
479 |
480 | Now, **each prior has a match**, positive or negative. By extension, **each prediction has a match**, positive or negative.
481 |
482 | Predictions that are positively matched with an object now have ground truth coordinates that will serve as **targets for localization**, i.e. in the _regression_ task. Naturally, there will be no target coordinates for negative matches.
483 |
484 | All predictions have a ground truth label, which is either the type of object if it is a positive match or a _background_ class if it is a negative match. These are used as **targets for class prediction**, i.e. the _classification_ task.
485 |
486 | #### Localization loss
487 |
488 | We have **no ground truth coordinates for the negative matches**. This makes perfect sense. Why train the model to draw boxes around empty space?
489 |
490 | Therefore, the localization loss is computed only on how accurately we regress positively matched predicted boxes to the corresponding ground truth coordinates.
491 |
492 | Since we predicted localization boxes in the form of offsets `(g_c_x, g_c_y, g_w, g_h)`, we would also need to encode the ground truth coordinates accordingly before we calculate the loss.
493 |
494 | The localization loss is the averaged **Smooth L1** loss between the encoded offsets of positively matched localization boxes and their ground truths.
495 |
496 | 
497 |
498 | #### Confidence loss
499 |
500 | Every prediction, no matter positive or negative, has a ground truth label associated with it. It is important that the model recognizes both objects and a lack of them.
501 |
502 | However, considering that there are usually only a handful of objects in an image, **the vast majority of the thousands of predictions we made do not actually contain an object**. As Walter White would say, _tread lightly_. If the negative matches overwhelm the positive ones, we will end up with a model that is less likely to detect objects because, more often than not, it is taught to detect the _background_ class.
503 |
504 | The solution may be obvious – limit the number of negative matches that will be evaluated in the loss function. But how do we choose?
505 |
506 | Well, why not use the ones that the model was most _wrong_ about? In other words, only use those predictions where the model found it hardest to recognize that there are no objects. This is called **Hard Negative Mining**.
507 |
508 | The number of hard negatives we will use, say `N_hn`, is usually a fixed multiple of the number of positive matches for this image. In this particular case, the authors have decided to use three times as many hard negatives, i.e. `N_hn = 3 * N_p`. The hardest negatives are discovered by finding the Cross Entropy loss for each negatively matched prediction and choosing those with top `N_hn` losses.
509 |
510 | Then, the confidence loss is simply the sum of the **Cross Entropy** losses among the positive and hard negative matches.
511 |
512 | 
513 |
514 | You will notice that it is averaged by the number of positive matches.
515 |
516 | #### Total loss
517 |
518 | The **Multibox loss is the aggregate of the two losses**, combined in a ratio `α`.
519 |
520 | 
521 |
522 | In general, we needn't decide on a value for `α`. It could be a learnable parameter.
523 |
524 | For the SSD, however, the authors simply use `α = 1`, i.e. add the two losses. We'll take it!
525 |
526 | ### Processing predictions
527 |
528 | After the model is trained, we can apply it to images. However, the predictions are still in their raw form – two tensors containing the offsets and class scores for 8732 priors. These would need to be processed to **obtain final, human-interpretable bounding boxes with labels.**
529 |
530 | This entails the following –
531 |
532 | - We have 8732 predicted boxes represented as offsets `(g_c_x, g_c_y, g_w, g_h)` from their respective priors. Decode them to boundary coordinates, which are actually directly interpretable.
533 |
534 | - Then, for each _non-background_ class,
535 |
536 | - Extract the scores for this class for each of the 8732 boxes.
537 |
538 | - Eliminate boxes that do not meet a certain threshold for this score.
539 |
540 | - The remaining (uneliminated) boxes are candidates for this particular class of object.
541 |
542 | At this point, if you were to draw these candidate boxes on the original image, you'd see **many highly overlapping boxes that are obviously redundant**. This is because it's extremely likely that, from the thousands of priors at our disposal, more than one prediction corresponds to the same object.
543 |
544 | For instance, consider the image below.
545 |
546 | 
547 |
548 | There's clearly only three objects in it – two dogs and a cat. But according to the model, there are _three_ dogs and _two_ cats.
549 |
550 | Mind you, this is just a mild example. It could really be much, much worse.
551 |
552 | Now, to you, it may be obvious which boxes are referring to the same object. This is because your mind can process that certain boxes coincide significantly with each other and a specific object.
553 |
554 | In practice, how would this be done?
555 |
556 | First, **line up the candidates for each class in terms of how _likely_ they are**.
557 |
558 | 
559 |
560 | We've sorted them by their scores.
561 |
562 | The next step is to find which candidates are redundant. We already have a tool at our disposal to judge how much two boxes have in common with each other – the Jaccard overlap.
563 |
564 | So, if we were to **draw up the Jaccard similarities between all the candidates in a given class**, we could evaluate each pair and **if found to overlap significantly, keep only the _more likely_ candidate**.
565 |
566 | 
567 |
568 | Thus, we've eliminated the rogue candidates – one of each animal.
569 |
570 | This process is called __Non-Maximum Suppression (NMS)__ because when multiple candidates are found to overlap significantly with each other such that they could be referencing the same object, **we suppress all but the one with the maximum score**.
571 |
572 | Algorithmically, it is carried out as follows –
573 |
574 | - Upon selecting candidates for each _non-background_ class,
575 |
576 | - Arrange candidates for this class in order of decreasing likelihood.
577 |
578 | - Consider the candidate with the highest score. Eliminate all candidates with lesser scores that have a Jaccard overlap of more than, say, `0.5` with this candidate.
579 |
580 | - Consider the next highest-scoring candidate still remaining in the pool. Eliminate all candidates with lesser scores that have a Jaccard overlap of more than `0.5` with this candidate.
581 |
582 | - Repeat until you run through the entire sequence of candidates.
583 |
584 | The end result is that you will have just a single box – the very best one – for each object in the image.
585 |
586 | 
587 |
588 | Non-Maximum Suppression is quite crucial for obtaining quality detections.
589 |
590 | Happily, it's also the final step.
591 |
592 | # Implementation
593 |
594 | The sections below briefly describe the implementation.
595 |
596 | They are meant to provide some context, but **details are best understood directly from the code**, which is quite heavily commented.
597 |
598 | ### Dataset
599 |
600 | We will use Pascal Visual Object Classes (VOC) data from the years 2007 and 2012.
601 |
602 | #### Description
603 |
604 | This data contains images with twenty different types of objects.
605 |
606 | ```python
607 | {'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'}
608 | ```
609 |
610 | Each image can contain one or more ground truth objects.
611 |
612 | Each object is represented by –
613 |
614 | - a bounding box in absolute boundary coordinates
615 |
616 | - a label (one of the object types mentioned above)
617 |
618 | - a perceived detection difficulty (either `0`, meaning _not difficult_, or `1`, meaning _difficult_)
619 |
620 | #### Download
621 |
622 | Specifically, you will need to download the following VOC datasets –
623 |
624 | - [2007 _trainval_](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar) (460MB)
625 |
626 | - [2012 _trainval_](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) (2GB)
627 |
628 | - [2007 _test_](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar) (451MB)
629 |
630 | Consistent with the paper, the two _trainval_ datasets are to be used for training, while the VOC 2007 _test_ will serve as our test data.
631 |
632 | Make sure you extract both the VOC 2007 _trainval_ and 2007 _test_ data to the same location, i.e. merge them.
633 |
634 | ### Inputs to model
635 |
636 | We will need three inputs.
637 |
638 | #### Images
639 |
640 | Since we're using the SSD300 variant, the images would need to be sized at `300, 300` pixels and in the RGB format.
641 |
642 | Remember, we're using a VGG-16 base pretrained on ImageNet that is already available in PyTorch's `torchvision` module. [This page](https://pytorch.org/docs/master/torchvision/models.html) details the preprocessing or transformation we would need to perform in order to use this model – pixel values must be in the range [0,1] and we must then normalize the image by the mean and standard deviation of the ImageNet images' RGB channels.
643 |
644 | ```python
645 | mean = [0.485, 0.456, 0.406]
646 | std = [0.229, 0.224, 0.225]
647 | ```
648 |
649 | Also, PyTorch follows the NCHW convention, which means the channels dimension (C) must precede the size dimensions.
650 |
651 | Therefore, **images fed to the model must be a `Float` tensor of dimensions `N, 3, 300, 300`**, and must be normalized by the aforesaid mean and standard deviation. `N` is the batch size.
652 |
653 | #### Objects' Bounding Boxes
654 |
655 | We would need to supply, for each image, the bounding boxes of the ground truth objects present in it in fractional boundary coordinates `(x_min, y_min, x_max, y_max)`.
656 |
657 | Since the number of objects in any given image can vary, we can't use a fixed size tensor for storing the bounding boxes for the entire batch of `N` images.
658 |
659 | Therefore, **ground truth bounding boxes fed to the model must be a list of length `N`, where each element of the list is a `Float` tensor of dimensions `N_o, 4`**, where `N_o` is the number of objects present in that particular image.
660 |
661 | #### Objects' Labels
662 |
663 | We would need to supply, for each image, the labels of the ground truth objects present in it.
664 |
665 | Each label would need to be encoded as an integer from `1` to `20` representing the twenty different object types. In addition, we will add a _background_ class with index `0`, which indicates the absence of an object in a bounding box. (But naturally, this label will not actually be used for any of the ground truth objects in the dataset.)
666 |
667 | Again, since the number of objects in any given image can vary, we can't use a fixed size tensor for storing the labels for the entire batch of `N` images.
668 |
669 | Therefore, **ground truth labels fed to the model must be a list of length `N`, where each element of the list is a `Long` tensor of dimensions `N_o`**, where `N_o` is the number of objects present in that particular image.
670 |
671 | ### Data pipeline
672 |
673 | As you know, our data is divided into _training_ and _test_ splits.
674 |
675 | #### Parse raw data
676 |
677 | See `create_data_lists()` in [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py).
678 |
679 | This parses the data downloaded and saves the following files –
680 |
681 | - A **JSON file for each split with a list of the absolute filepaths of `I` images**, where `I` is the total number of images in the split.
682 |
683 | - A **JSON file for each split with a list of `I` dictionaries containing ground truth objects, i.e. bounding boxes in absolute boundary coordinates, their encoded labels, and perceived detection difficulties**. The `i`th dictionary in this list will contain the objects present in the `i`th image in the previous JSON file.
684 |
685 | - A **JSON file which contains the `label_map`**, the label-to-index dictionary with which the labels are encoded in the previous JSON file. This dictionary is also available in [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py) and directly importable.
686 |
687 | #### PyTorch Dataset
688 |
689 | See `PascalVOCDataset` in [`datasets.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/datasets.py).
690 |
691 | This is a subclass of PyTorch [`Dataset`](https://pytorch.org/docs/master/data.html#torch.utils.data.Dataset), used to **define our training and test datasets.** It needs a `__len__` method defined, which returns the size of the dataset, and a `__getitem__` method which returns the `i`th image, bounding boxes of the objects in this image, and labels for the objects in this image, using the JSON files we saved earlier.
692 |
693 | You will notice that it also returns the perceived detection difficulties of each of these objects, but these are not actually used in training the model. They are required only in the [Evaluation](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#evaluation) stage for computing the Mean Average Precision (mAP) metric. We also have the option of filtering out _difficult_ objects entirely from our data to speed up training at the cost of some accuracy.
694 |
695 | Additionally, inside this class, **each image and the objects in them are subject to a slew of transformations** as described in the paper and outlined below.
696 |
697 | #### Data Transforms
698 |
699 | See `transform()` in [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py).
700 |
701 | This function applies the following transformations to the images and the objects in them –
702 |
703 | - Randomly **adjust brightness, contrast, saturation, and hue**, each with a 50% chance and in random order.
704 |
705 | - With a 50% chance, **perform a _zoom out_ operation** on the image. This helps with learning to detect small objects. The zoomed out image must be between `1` and `4` times as large as the original. The surrounding space could be filled with the mean of the ImageNet data.
706 |
707 | - Randomly crop image, i.e. **perform a _zoom in_ operation.** This helps with learning to detect large or partial objects. Some objects may even be cut out entirely. Crop dimensions are to be between `0.3` and `1` times the original dimensions. The aspect ratio is to be between `0.5` and `2`. Each crop is made such that there is at least one bounding box remaining that has a Jaccard overlap of either `0`, `0.1`, `0.3`, `0.5`, `0.7`, or `0.9`, randomly chosen, with the cropped image. In addition, any bounding boxes remaining whose centers are no longer in the image as a result of the crop are discarded. There is also a chance that the image is not cropped at all.
708 |
709 | - With a 50% chance, **horizontally flip** the image.
710 |
711 | - **Resize** the image to `300, 300` pixels. This is a requirement of the SSD300.
712 |
713 | - Convert all boxes from **absolute to fractional boundary coordinates.** At all stages in our model, all boundary and center-size coordinates will be in their fractional forms.
714 |
715 | - **Normalize** the image with the mean and standard deviation of the ImageNet data that was used to pretrain our VGG base.
716 |
717 | As mentioned in the paper, these transformations play a crucial role in obtaining the stated results.
718 |
719 | #### PyTorch DataLoader
720 |
721 | The `Dataset` described above, `PascalVOCDataset`, will be used by a PyTorch [`DataLoader`](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) in `train.py` to **create and feed batches of data to the model** for training or evaluation.
722 |
723 | Since the number of objects vary across different images, their bounding boxes, labels, and difficulties cannot simply be stacked together in the batch. There would be no way of knowing which objects belong to which image.
724 |
725 | Instead, we need to **pass a collating function to the `collate_fn` argument**, which instructs the `DataLoader` about how it should combine these varying size tensors. The simplest option would be to use Python lists.
726 |
727 | ### Base Convolutions
728 |
729 | See `VGGBase` in [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py).
730 |
731 | Here, we **create and apply base convolutions.**
732 |
733 | The layers are initialized with parameters from a pretrained VGG-16 with the `load_pretrained_layers()` method.
734 |
735 | We're especially interested in the lower-level feature maps that result from `conv4_3` and `conv7`, which we return for use in subsequent stages.
736 |
737 | ### Auxiliary Convolutions
738 |
739 | See `AuxiliaryConvolutions` in [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py).
740 |
741 | Here, we **create and apply auxiliary convolutions.**
742 |
743 | Use a [uniform Xavier initialization](https://pytorch.org/docs/stable/nn.html#torch.nn.init.xavier_uniform_) for the parameters of these layers.
744 |
745 | We're especially interested in the higher-level feature maps that result from `conv8_2`, `conv9_2`, `conv10_2` and `conv11_2`, which we return for use in subsequent stages.
746 |
747 | ### Prediction Convolutions
748 |
749 | See `PredictionConvolutions` in [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py).
750 |
751 | Here, we **create and apply localization and class prediction convolutions** to the feature maps from `conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2` and `conv11_2`.
752 |
753 | These layers are initialized in a manner similar to the auxiliary convolutions.
754 |
755 | We also **reshape the resulting prediction maps and stack them** as discussed. Note that reshaping in PyTorch is only possible if the original tensor is stored in a [contiguous](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.contiguous) chunk of memory.
756 |
757 | As expected, the stacked localization and class predictions will be of dimensions `8732, 4` and `8732, 21` respectively.
758 |
759 | ### Putting it all together
760 |
761 | See `SSD300` in [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py).
762 |
763 | Here, the **base, auxiliary, and prediction convolutions are combined** to form the SSD.
764 |
765 | There is a small detail here – the lowest level features, i.e. those from `conv4_3`, are expected to be on a significantly different numerical scale compared to its higher-level counterparts. Therefore, the authors recommend L2-normalizing and then rescaling _each_ of its channels by a learnable value.
766 |
767 | ### Priors
768 |
769 | See `create_prior_boxes()` under `SSD300` in [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py).
770 |
771 | This function **creates the priors in center-size coordinates** as defined for the feature maps from `conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2` and `conv11_2`, _in that order_. Furthermore, for each feature map, we create the priors at each tile by traversing it row-wise.
772 |
773 | This ordering of the 8732 priors thus obtained is very important because it needs to match the order of the stacked predictions.
774 |
775 | ### Multibox Loss
776 |
777 | See `MultiBoxLoss` in [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py).
778 |
779 | Two empty tensors are created to store localization and class prediction targets, i.e. _ground truths_, for the 8732 predicted boxes in each image.
780 |
781 | We **find the ground truth object with the maximum Jaccard overlap for each prior**, which is stored in `object_for_each_prior`.
782 |
783 | We want to avoid the rare situation where not all of the ground truth objects have been matched. Therefore, we also **find the prior with the maximum overlap for each ground truth object**, stored in `prior_for_each_object`. We explicitly add these matches to `object_for_each_prior` and artificially set their overlaps to a value above the threshold so they are not eliminated.
784 |
785 | Based on the matches in `object_for_each prior`, we set the corresponding labels, i.e. **targets for class prediction**, to each of the 8732 priors. For those priors that don't overlap significantly with their matched objects, the label is set to _background_.
786 |
787 | Also, we encode the coordinates of the 8732 matched objects in `object_for_each prior` in offset form `(g_c_x, g_c_y, g_w, g_h)` with respect to these priors, to form the **targets for localization**. Not all of these 8732 localization targets are meaningful. As we discussed earlier, only the predictions arising from the non-background priors will be regressed to their targets.
788 |
789 | The **localization loss** is the [Smooth L1 loss](https://pytorch.org/docs/stable/nn.html#torch.nn.SmoothL1Loss) over the positive matches.
790 |
791 | Perform Hard Negative Mining – rank class predictions matched to _background_, i.e. negative matches, by their individual [Cross Entropy losses](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss). The **confidence loss** is the Cross Entropy loss over the positive matches and the hardest negative matches. Nevertheless, it is averaged only by the number of positive matches.
792 |
793 | The **Multibox Loss is the aggregate of these two losses**, combined in the ratio `α`. In our case, they are simply being added because `α = 1`.
794 |
795 | # Training
796 |
797 | Before you begin, make sure to save the required data files for training and evaluation. To do this, run the contents of [`create_data_lists.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/create_data_lists.py) after pointing it to the `VOC2007` and `VOC2012` folders in your [downloaded data](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#download).
798 |
799 | See [`train.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/train.py).
800 |
801 | The parameters for the model (and training it) are at the beginning of the file, so you can easily check or modify them should you need to.
802 |
803 | To **train your model from scratch**, run this file –
804 |
805 | `python train.py`
806 |
807 | To **resume training at a checkpoint**, point to the corresponding file with the `checkpoint` parameter at the beginning of the code.
808 |
809 | ### Remarks
810 |
811 | In the paper, they recommend using **Stochastic Gradient Descent** in batches of `32` images, with an initial learning rate of `1e−3`, momentum of `0.9`, and `5e-4` weight decay.
812 |
813 | I ended up using a batch size of `8` images for increased stability. If you find that your gradients are exploding, you could reduce the batch size, like I did, or clip gradients.
814 |
815 | The authors also doubled the learning rate for bias parameters. As you can see in the code, this is easy do in PyTorch, by passing [separate groups of parameters](https://pytorch.org/docs/stable/optim.html#per-parameter-options) to the `params` argument of its [SGD optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD).
816 |
817 | The paper recommends training for 80000 iterations at the initial learning rate. Then, it is decayed by 90% (i.e. to a tenth) for an additional 20000 iterations, _twice_. With the paper's batch size of `32`, this means that the learning rate is decayed by 90% once after the 154th epoch and once more after the 193th epoch, and training is stopped after 232 epochs. I followed this schedule.
818 |
819 | On a TitanX (Pascal), each epoch of training required about 6 minutes.
820 |
821 | I should note here that two unintended differences from the paper were brought to my attention by readers of this tutorial:
822 |
823 | - My priors that overshoot the edges of the image are not being clipped, as pointed out in issue [#94](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/94) by _@AakiraOtok_. This does not appear to have a negative effect on performance, however, as discussed in that issue and also verified in issue [#95](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/95) by the same reader. It is even possible that there is a slight improvement in performance, but this may be too small to be conclusive.
824 |
825 | - I mistakenly used L1 loss instead of *smooth* L1 loss as the localization loss, as pointed out in issue [#60](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/60) by _jonathan016_. This also appears to have no negative effect on performance as pointed out in that issue, but _smooth_ L1 loss may offer better training stability with larger batch sizes as mentioned in [this comment](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/94#issuecomment-1590217018).
826 |
827 | ### Model checkpoint
828 |
829 | You can download this pretrained model [here](https://drive.google.com/open?id=1bvJfF6r_zYl2xZEpYXxgb7jLQHFZ01Qe).
830 |
831 | Note that this checkpoint should be [loaded directly with PyTorch](https://pytorch.org/docs/stable/torch.html?#torch.load) for evaluation or inference – see below.
832 |
833 | # Evaluation
834 |
835 | See [`eval.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/eval.py).
836 |
837 | The data-loading and checkpoint parameters for evaluating the model are at the beginning of the file, so you can easily check or modify them should you wish to.
838 |
839 | To begin evaluation, simply run the `evaluate()` function with the data-loader and model checkpoint. **Raw predictions for each image in the test set are obtained and parsed** with the checkpoint's `detect_objects()` method, which implements [this process](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#processing-predictions). Evaluation has to be done at a `min_score` of `0.01`, an NMS `max_overlap` of `0.45`, and `top_k` of `200` to allow fair comparision of results with the paper and other implementations.
840 |
841 | **Parsed predictions are evaluated against the ground truth objects.** The evaluation metric is the _Mean Average Precision (mAP)_. If you're not familiar with this metric, [here's a great explanation](https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173).
842 |
843 | We will use `calculate_mAP()` in [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py) for this purpose. As is the norm, we will ignore _difficult_ detections in the mAP calculation. But nevertheless, it is important to include them from the evaluation dataset because if the model does detect an object that is considered to be _difficult_, it must not be counted as a false positive.
844 |
845 | The model scores **77.2 mAP**, same as the result reported in the paper.
846 |
847 | Class-wise average precisions (not scaled to 100) are listed below.
848 |
849 | | Class | Average Precision |
850 | | :-----: | :------: |
851 | | _aeroplane_ | 0.7887580990791321 |
852 | | _bicycle_ | 0.8351995348930359 |
853 | | _bird_ | 0.7623348236083984 |
854 | | _boat_ | 0.7218425273895264 |
855 | | _bottle_ | 0.45978495478630066 |
856 | | _bus_ | 0.8705356121063232 |
857 | | _car_ | 0.8655831217765808 |
858 | | _cat_ | 0.8828985095024109 |
859 | | _chair_ | 0.5917483568191528 |
860 | | _cow_ | 0.8255912661552429 |
861 | | _diningtable_ | 0.756867527961731 |
862 | | _dog_ | 0.856262743473053 |
863 | | _horse_ | 0.8778411149978638 |
864 | | _motorbike_ | 0.8316892385482788 |
865 | | _person_ | 0.7884440422058105 |
866 | | _pottedplant_ | 0.5071538090705872 |
867 | | _sheep_ | 0.7936667799949646 |
868 | | _sofa_ | 0.7998116612434387 |
869 | | _train_ | 0.8655905723571777 |
870 | | _tvmonitor_ | 0.7492395043373108 |
871 |
872 | You can see that some objects, like bottles and potted plants, are considerably harder to detect than others.
873 |
874 | # Inference
875 |
876 | See [`detect.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/detect.py).
877 |
878 | Point to the model you want to use for inference with the `checkpoint` parameter at the beginning of the code.
879 |
880 | Then, you can use the `detect()` function to identify and visualize objects in an RGB image.
881 |
882 | ```python
883 | img_path = '/path/to/ima.ge'
884 | original_image = PIL.Image.open(img_path, mode='r')
885 | original_image = original_image.convert('RGB')
886 |
887 | detect(original_image, min_score=0.2, max_overlap=0.5, top_k=200).show()
888 | ```
889 |
890 | This function first **preprocesses the image by resizing and normalizing its RGB channels** as required by the model. It then **obtains raw predictions from the model, which are parsed** by the `detect_objects()` method in the model. The parsed results are converted from fractional to absolute boundary coordinates, their labels are decoded with the `label_map`, and they are **visualized on the image**.
891 |
892 | There are no one-size-fits-all values for `min_score`, `max_overlap`, and `top_k`. You may need to experiment a little to find what works best for your target data.
893 |
894 | ### Some more examples
895 |
896 | ---
897 |
898 |
899 |
900 |
901 |
902 | ---
903 |
904 |
905 |
906 |
907 |
908 | ---
909 |
910 |
911 |
912 |
913 |
914 | ---
915 |
916 |
917 |
918 |
919 |
920 | ---
921 |
922 |
923 |
924 |
925 |
926 | ---
927 |
928 |
929 |
930 |
931 |
932 | ---
933 |
934 |
935 |
936 |
937 |
938 | ---
939 |
940 |
941 |
942 |
943 |
944 | ---
945 |
946 |
947 |
948 |
949 |
950 | ---
951 |
952 |
953 |
954 |
955 |
956 | ---
957 |
958 |
959 |
960 |
961 |
962 | ---
963 |
964 | # FAQs
965 |
966 | __I noticed that priors often overshoot the `3, 3` kernel employed in the prediction convolutions. How can the kernel detect a bound (of an object) outside it?__
967 |
968 | Don't confuse the kernel and its _receptive field_, which is the area of the original image that is represented in the kernel's field-of-view.
969 |
970 | For example, on the `38, 38` feature map from `conv4_3`, a `3, 3` kernel covers an area of `0.08, 0.08` in fractional coordinates. The priors are `0.1, 0.1`, `0.14, 0.07`, `0.07, 0.14`, and `0.14, 0.14`.
971 |
972 | But its receptive field, which [you can calculate](https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807), is a whopping `0.36, 0.36`! Therefore, all priors (and objects contained therein) are present well inside it.
973 |
974 | Keep in mind that the receptive field grows with every successive convolution. For `conv_7` and the higher-level feature maps, a `3, 3` kernel's receptive field will cover the _entire_ `300, 300` image. But, as always, the pixels in the original image that are closer to the center of the kernel have greater representation, so it is still _local_ in a sense.
975 |
976 | ---
977 |
978 | __While training, why can't we match predicted boxes directly to their ground truths?__
979 |
980 | We cannot directly check for overlap or coincidence between predicted boxes and ground truth objects to match them because predicted boxes are not to be considered reliable, _especially_ during the training process. This is the very reason we are trying to evaluate them in the first place!
981 |
982 | And this is why priors are especially useful. We can match a predicted box to a ground truth box by means of the prior it is supposed to be approximating. It no longer matters how correct or wildly wrong the prediction is.
983 |
984 | ---
985 |
986 | __Why do we even have a _background_ class if we're only checking which _non-background_ classes meet the threshold?__
987 |
988 | When there is no object in the approximate field of the prior, a high score for _background_ will dilute the scores of the other classes such that they will not meet the detection threshold.
989 |
990 | ---
991 |
992 | __Why not simply choose the class with the highest score instead of using a threshold?__
993 |
994 | I think that's a valid strategy. After all, we implicitly conditioned the model to choose _one_ class when we trained it with the Cross Entropy loss. But you will find that you won't achieve the same performance as you would with a threshold.
995 |
996 | I suspect this is because object detection is open-ended enough that there's room for doubt in the trained model as to what's really in the field of the prior. For example, the score for _background_ may be high if there is an appreciable amount of backdrop visible in an object's bounding box. There may even be multiple objects present in the same approximate region. A simple threshold will yield all possibilities for our consideration, and it just works better.
997 |
998 | Redundant detections aren't really a problem since we're NMS-ing the hell out of 'em.
999 |
1000 |
1001 | ---
1002 |
1003 | __Sorry, but I gotta ask... _[what's in the boooox?!](https://cnet4.cbsistatic.com/img/cLD5YVGT9pFqx61TuMtcSBtDPyY=/570x0/2017/01/14/6d8103f7-a52d-46de-98d0-56d0e9d79804/se7en.png)___
1004 |
1005 | Ha.
1006 |
--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
1 | 这是一个**[PyTorch](https://pytorch.org)的目标检测教程**。
2 |
3 | 这是我正在写的[一系列教程](https://github.com/sgrvinod/Deep-Tutorials-for-PyTorch)中的第三个,这些教程是关于你自己用牛逼的PyTorch库*实现*一些很酷的模型的。
4 |
5 | 假设你已经掌握了这些基础:Pytorch,卷积神经网络。
6 |
7 | 如果你是PyTorch新手,先看看[PyTorch深度学习:60分钟闪电战](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)和[通过例子学习PyTorch](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)。
8 |
9 | 问题,建议或者勘误可以发送到issues。
10 |
11 | 我的环境是`Python 3.6`下的`PyTorch 0.4`。
12 |
13 | ---
14 |
15 | **2020年2月27日**:两个新教程的工作代码已经添加——[Super-Resolution](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Super-Resolution)和[Machine Translation](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Machine-Translation)
16 |
17 | # 目录
18 |
19 | [***目标***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#objective)
20 |
21 | [***概念***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#concepts)
22 |
23 | [***概览***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#overview)
24 |
25 | [***实现***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#implementation)
26 |
27 | [***训练***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#training)
28 |
29 | [***评估***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#evaluation)
30 |
31 | [***推理***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#inference)
32 |
33 | [***FAQs***](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#faqs)
34 |
35 | # 目标
36 |
37 | **建立一个模型来检测和定位图片中的某个目标。**
38 |
39 |
40 |
41 |
42 |
43 | 我们将会实现 [Single Shot Multibox Detector (SSD)](https://arxiv.org/abs/1512.02325),一款该任务下流行、强大并且十分灵活的网络。作者原始的实现能在[这里](https://github.com/weiliu89/caffe/tree/ssd)找到
44 |
45 | 以下是一些图片上看不见的目标检测的例子——
46 |
47 | ---
48 |
49 |
50 |
51 |
52 |
53 | ---
54 |
55 |
56 |
57 |
58 |
59 | ---
60 |
61 |
62 |
63 |
64 |
65 | ---
66 |
67 |
68 |
69 |
70 |
71 | ---
72 |
73 |
74 |
75 |
76 |
77 | ---
78 |
79 |
80 |
81 |
82 |
83 | ---
84 |
85 |
86 |
87 |
88 |
89 | ---
90 |
91 |
92 |
93 |
94 |
95 | ---
96 |
97 | [教程末尾](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#some-more-examples)有更多的例子。
98 |
99 | ---
100 |
101 | # 概念
102 |
103 | - **目标检测**(Object Detection):不解释。
104 | - **SSD**(Single-Shot Detection):早期目标检测分为两个部分——一个是找出目标位置的网络(原文强调该网络负责提出那些存在目标的区域),和一个检测目标区域中实体的分类器。从计算角度来说,这两部分可能会非常贵并且对于实时、实地应用都不合适。SSD模型把精定位和检测任务压缩到一个网络的单次前向传播过程,在能部署在更轻量级的硬件上的同时带来了显著更快的检测。
105 | - **多尺寸特征图**(Multiscale Feature Maps):在图像分类的任务中,其预测结果是建立在最后一层卷积特征图上的,这一层特征图是最小但同时也是原图最深层次的代表。在目标检测中,来自中间卷积层的特征图也会_直接_起作用,因为它们代表了原图的不同尺寸。因此,一个固定尺寸的过滤器(卷积核)作用于不同的特征图能检测出不同尺寸的目标。
106 | - **预定位**(Priors):在一张特征图上的具体位置上会有一些预定位框(原文这里指提前定位好),这些定位框有着特定的大小。预定位框是仔细选择后与数据集中目标的定位框(也就是数据集中实实在在的定位框)特征最相似的
107 | - **多定位框**(Multibox):一种把预测定位框表述为回归问题的[技术](https://arxiv.org/abs/1312.2249),检测目标的坐标回归到它真实的坐标。此外,对于每一个将要被预测的定位框,对于不同的目标类别会有不同的得分。预定位将作为预测的可行起始点,因为这些预定位框是根据事实(数据集)建模的。因此将会出现与预定位一样多的预测框,尽管大多数可能不含有目标。
108 | - **硬性负样本挖掘**(Hard Negative Mining):这指的是选择那些预测结果中令人震惊的假正例(FP:False Positive),并加强在这些样本上的学习。换句话说,我们仅在模型中_最难_正确识别的负样本中挖掘信息。如上文所说绝大多数预测框不含目标,这可以平衡正负样本。
109 | - **非最大抑制**(Non-Maximum Suppression):对于任意给定位置,显然多个预定位框会重叠。因此,由这些预定位框产生的预测实际上可能是同一个重复的目标。非最大抑制(NMS)是通过抑制除了最高得分以外的所有预测来消除冗余的手段。
110 |
111 | # 概览
112 |
113 | 这一部分,我会讲一下这个模型的概述。如果你已经熟悉了,你可以跳过直接到[实现](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#implementation)部分或者去看看代码注释。
114 |
115 | 随着我们的深入,你会注意到SSD的结构和思想包含了相当多的工程设计。如果一些概念看起来相当做作或是不讲道理,别担心!记住,它建立在这个领域多年的研究之上(通常是经验主义的)
116 |
117 | ### 定义
118 |
119 | 边框就是一个盒子。定位框就是将目标围起来的盒子,也就代表了目标的边界
120 |
121 | 在这个教程中,我们只会遇到两种类型——边框和定位框。但是所有的边框都只会呈现在图片上并且我们需要能够计算他们的位置、形状、大小还有其他属性
122 |
123 | #### 边界坐标
124 |
125 | 最明显的方式来描述一个边框是通过边界线`x`和`y`的像素坐标
126 |
127 | 
128 |
129 | 这张图中的定位框的边界坐标就是**`(x_min, y_min, x_max, y_max)`**
130 |
131 | 但如果我们不知道实际图像的尺寸,像素值几乎是没用的。一个更好的办法是将所有坐标替换为它们的分数形式
132 |
133 | 
134 |
135 | 现在坐标就与图像大小无关并且所有图片上的所有边框都在同一尺度下。
136 |
137 | #### 中心——大小坐标
138 |
139 | 这是一个对边框位置和大小更直接的表示
140 |
141 | 
142 |
143 | 这张图的中心——大小坐标就是**`(c_x, c_y, w, h)`**。
144 |
145 | 在代码中,你会发现我们通常两种坐标都会使用,这取决于它们对任务的适应性,并且我们_总是_使用它们的分数形式
146 |
147 | #### 交并比(Jaccard Index)
148 |
149 | Jaccard Index也叫Jaccard Overlap或者说交并比( Intersection-over-Union loU)用于测量两个边框的重叠程度
150 |
151 | 
152 |
153 | 交并比等于1意味着两个边框相同,0表示两个边框互斥
154 |
155 | 这是一个数学度量,但同样能在我们的模型中找到许多应用
156 |
157 | ### 多边框
158 |
159 | 多边框是一个目标检测的技术,其由两个部分组成——
160 |
161 | - **可能含也可能不含的目标的边框的坐标**。这是一个回归问题。
162 | - **特定边框中不同目标类型的得分**,包括一个*背景*类来表示边框中没有目标,这是一个*分类*任务
163 |
164 | ### Single Shot Detector (SSD)
165 |
166 | SSD是一个纯粹的卷积神经网络(CNN),我们可以把它归结为三类——
167 |
168 | - **基础卷积** 借鉴自现有的图片分类结构,这个结构将提供低维特征
169 | - **辅助卷积** 添加在基础网络之上,这个结构将提供高维特征
170 | - **预测卷积** 这个结构将在特征图中定位并识别目标
171 |
172 | 论文中给出了这个模型的两种变体:SSD300与SSD512。
173 |
174 | 其中后缀代表输入图片的大小,尽管两种网络在构建的时候稍有不同,但他们在原理上是一致的。SSD512仅仅知识更大而稍好一点的网络
175 |
176 | ### 基础卷积——part1
177 |
178 | 首先,为什么在现存网络中使用卷积结构?
179 |
180 | 因为经过论证在图片分类表现良好的模型,已经有相当好的图片本质捕捉能力。同样的卷积特征在目标检测上十分有用,尤其是局部感知上——我们更想关注目标所在的部分而不是把图像当做一个整体
181 |
182 | 此外,优势还有能够使用在可靠分类数据集上的预训练层。正如你所知道的,这叫做*迁移学习*。通过借鉴一个不同但是密切相关的任务,我们甚至在开始前就有了进展。
183 |
184 | 论文的作者将**VGG-16结构**作为其基础网络。它的原始形式相当简单。
185 |
186 | 
187 |
188 | 他们建议使用在ImageNet*大规模视觉识别竞赛*(ILSVRC)分类任务中预先训练过的模型。幸运的是,PyTorch中已经有一个可用的模型,其他流行的AI引擎也是如此。如果你愿意,你可以选择像ResNet这样更大的东西。只需注意计算要求。
189 |
190 | 根据这篇论文,**我们必须对这个预先训练的网络进行一些更改**,以使其适应我们自己在目标检测方面的挑战。有些是合乎逻辑和必要的,而另一些则主要是出于方便或偏好。
191 |
192 | - **输入大小**:如上所说`300, 300`
193 | - **第三个池化层**:维度减半,确认输出大小将会使用数学中的`ceiling`函数(向上取整)而不是默认的`floor`函数。当前面的特征图的维度是奇数而不是偶数时这才重要。通过观察上面的图片,你能够计算出我们的输入图片大小是`300, 300``conv3_3`特征图将被截断为`75, 75`,这将被减半为`38, 38`,而不是麻烦的`37, 37`
194 | - 我们将第**五个池化层**从`2, 2`内核,步长`2`修改为`3, 3`内核,步长`1`。这样做的效果是,它不再将先前卷积层的特征图的维度减半。
195 | - 我们不需要全连接层(事实上是分类),他们在这里毫无用处。我们将完全砍掉`fc8`,并且选择***将*`fc6`和`fc7`重做为卷积层`conv6`和`conv7`**
196 |
197 | 第一步的三个修改已经足够直接了,但最后一个可能需要一些解释
198 |
199 | ### 全连接→ 卷积层(FC → Convolutional Layer)
200 |
201 | 我们如何将全连接层重新参数化为卷积层?
202 |
203 | 注意到下面的方案。
204 |
205 | 在典型的图像分类设置中,第一个全连接层不能对前面的特征图或图像_直接_进行操作。我们需要把它压扁成一维结构。
206 |
207 | 
208 |
209 | 在这个例子中,有一张`2, 2, 3`维度的图片,展开为一个大小`12`的一维向量,对于输出大小`2`,全连接层计算两次点积,分别是这个展开的一维向量和两个相同大小`12`的向量的点积。**这两个向量,在图中用灰色表示,就是全连接层的参数**。
210 |
211 | 现在考虑一个不同的方案,在这个方案中,我们使用卷积层来产生两个输出值
212 |
213 | 
214 |
215 | 这里,图片的维度是`2, 2, 3`定死了要保证不被展开。卷积层使用两个过滤器来执行两个点积, 每个过滤器包含`12`个元素并与图像形状相同。**这两个过滤器,在图中用灰色表示,就是卷积层的参数**。
216 |
217 | 这就是关键点——**在两种方案中,输出`Y_0`和`Y_1`是一样的!**
218 |
219 | 
220 |
221 | 这两种方案是恒等的。
222 |
223 | 这告诉我们什么?
224 |
225 | **在一张大小`H, w`,`I`通道的图片上,输出大小`N`的全连接层等价于卷积层,卷积核的大小与图片大小相同`H, w`,`I`通道**,这证明全连接层网络`N, H * W * I`的参数与卷积层`N, H, W, I`的参数相同
226 |
227 | 
228 |
229 | 因此,**通过改变参数的形状**,任何全连接层都能被转换为一个等价卷积层。
230 |
231 | ### 基础卷积——part2
232 |
233 | 我们现在知道如何将原来VGG-16结构中的`fc6`与`fc7`分别地转换为`conv6`与`conv7`
234 |
235 | 在[之前展示]((https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#base-convolutions--part-1))的ImageNet VGG-16中,在对图片大小为`224, 224, 3`的操作中,你能发现`conv5_3`的输出大小是`7, 7, 512`。因此——
236 |
237 | - `fc6`的输入是展开的`7 * 7 * 512`,输出大小是`4096`包含`4096, 7 * 7 * 512`维度的参数。**其等价卷积层`conv6`的卷积核大小是`7, 7`,输出通道数是`4096`,全连接层参数的形状将被改变为`4096, 7, 7, 512`**。
238 | - `fc7`的输入大小是`4096`(事实上就是`fc6`的输出),输出大小是`4096`包含`4096, 4096`维度的参数。**其等价卷积层`conv6`的卷积核大小是`1, 1`,输出通道数是`4096`,全连接层参数的形状将被改变为`4096, 1, 1, 4096`**。
239 |
240 | 我们发现`conv6`有`4096`个过滤器,每一个的大小是`7, 7, 512`,`conv7`有`4096`个过滤器,每一个大小是`1, 1, 4096`。
241 |
242 | 这些过滤器很繁杂,也很大——并且算力消耗大
243 |
244 | 为了改进这一点,作者选择**通过二次采样来减少过滤器的数量和每个过滤器的大小**,对于转换后的卷积层。
245 |
246 | - `conv6`将使用`1024`个过滤器,每个大小`3, 3, 512`。因此,参数从 `4096, 7, 7, 512`二次采样到 `1024, 3, 3, 512`。
247 | - `conv7`将使用`1024`个过滤器,每个大小`1, 1, 1024`。因此,参数从 `4096, 1, 1, 4096`二次采样到 `1024, 1, 1, 1024`。
248 |
249 | 基于论文中的引用,我们将**沿特定维度选择第`m`个参数来二次采样**,在处理中被称为[降采样](https://en.wikipedia.org/wiki/Downsampling_(signal_processing))。
250 |
251 | 通过只保留第三个值,`conv6`的卷积核从`7, 7`降采样到`3, 3`,卷积核中现在有一些洞。因此我们需要**卷积核扩张或萎缩**
252 |
253 | 这相当于一个`3`的膨胀(与降采样因子`m = 3`相同)。尽管如此,作者实际上采用的是一个`6`的膨胀,大概是因为第5个池化层并不减半之前特征图的维度
254 |
255 | 我们现在处于搭建基础网络的位置,**VGG-16 修改版**。
256 |
257 | 
258 |
259 | ### 辅助卷积
260 |
261 | 现在我们将在**基础网络上叠加一些额外的卷积层**。这些卷积提供了额外的特征图,每个特征图都会逐渐变小
262 |
263 | 
264 |
265 | 图中展示了4个卷积块,每个块都有两层。在基础网络中的池化层也有减少维度的作用,而这里通过把每个块中第二层的步长设置为`2`促进这个过程。
266 |
267 | 同样地,请留意这些来自`conv8_2`, `conv9_2`, `conv10_2`, 和`conv11_2`的特征图。
268 |
269 | ### 思考
270 |
271 | 在我们进入预测卷积之前,我们首先需要了解我们在预测的是什么。很明显,是目标和目标所在的位置,*但它们是以什么形式给出的?*
272 |
273 | 现在我们得了解一些关于**预定位**和它在SSD中的关键作用
274 |
275 | #### 预定位
276 |
277 | 预定位是十分多样的,并不只是在预测的种类上。目标可以出现在任何位置,大小和形状都是任意的。同时,我们不应该说目标出现在哪、以何种方式出现都有**无限**可能。尽管在数学上是对的,但也有许多的选择是不合理的。更进一步来讲,我们不必要求边框在像素上是完美的。
278 |
279 | 事实上,我们能把潜在的预测空间从数学上减少到仅几千几万种可能。
280 |
281 | **框预定位是提前计算好的,也是固定的,它代表其中的所有可能和近似的边框预测**
282 |
283 | 在数据集中实实在在的目标的形状和大小上,预定位必须精挑细选。同样考虑到位置的多样性,我们把预定位放在特征图中的每一个可能的位置。
284 |
285 | 在预定位框的定义中,作者特别指出——
286 |
287 | - **这将会应用与各种各样的低维和高维特征图**,也就是那些来自 `conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2` 和`conv11_2`的特征图。这些都是在之前图上表明的特征图
288 | - **如果预定位框有一个缩放量`s`,那么它的面积等于一个边长为`s`的正方形**,最大的特征图,`conv4_3`,其预定位的缩放量为`0.1`,也就是`10%`的图片维度,同样的,其余预定位的缩放量从`0.2`到`0.9`线性递增。正如你所想,最大的特征图的预定位缩放量更小,并且因此能够检测更小的物体
289 | - **在特征图的*每一个*位置,都会各种各样的预定位框,这些预定位框有着不同的横纵比**。所有的特征图都会有如下横纵比的预定位框`1:1, 2:1, 1:2`。`conv7`, `conv8_2`, 和`conv9_2`中间层的特征图的预定位框将有*更多*的横纵比`3:1, 1:3` 。更进一步,所有特征图将有一个额外预定位框,其横纵比为`1:1`,其缩放量为当前特征图与后继特征图缩放的几何平均数。
290 |
291 | | 特征图来源 | 特征图大小 | 预定位框缩放量 | 横纵比 | 每个位置的预定位框数量 | 预定位框总数 |
292 | | :--------: | :--------: | :------------: | :----------------------------: | :--------------------: | :---------------: |
293 | | `conv4_3` | 38, 38 | 0.1 | 1:1, 2:1, 1:2 + 额外 | 4 | 5776 |
294 | | `conv7` | 19, 19 | 0.2 | 1:1, 2:1, 1:2, 3:1, 1:3 + 额外 | 6 | 2166 |
295 | | `conv8_2` | 10, 10 | 0.375 | 1:1, 2:1, 1:2, 3:1, 1:3 + 额外 | 6 | 600 |
296 | | `conv9_2` | 5, 5 | 0.55 | 1:1, 2:1, 1:2, 3:1, 1:3 + 额外 | 6 | 150 |
297 | | `conv10_2` | 3, 3 | 0.725 | 1:1, 2:1, 1:2 + 额外 | 4 | 36 |
298 | | `conv11_2` | 1, 1 | 0.9 | 1:1, 2:1, 1:2 + 额外 | 4 | 4 |
299 | | **总计** | – | – | – | – | **8732 预定位框** |
300 |
301 | SSD300中一共有8732个预定位框!
302 |
303 | #### 可视化预定位框
304 |
305 | 我们根据预定位框的*缩放量*和*横纵比*来定义预定位框
306 |
307 | *(译者注:w即width宽度,h即hight高度,s即scales被译为缩放量,a即aspect ratios被译为横纵比)*
308 |
309 | 
310 |
311 | 变换这些方程可以得到预定位框的维度`w`和`h`
312 |
313 | 
314 |
315 | 我们现在可以分别地在特征图上画出预定位框
316 |
317 | 例如,我们想要可视化`conv9_2`中心方块上的预定位框是什么样的
318 |
319 | 
320 |
321 | *(译者注:图中文字内容如下,在每个位置,有五个预定位框,其横纵比分别为1,2,3,1/2,1/3并且面积等于边长为0.55的正方形。另外,第六个预定位框的横纵比是1,边长为0.63)*
322 |
323 | 同样地可以看到其他方块上的预定位框
324 |
325 | 
326 |
327 | *(译者注:图中文字内容如下,当预定位框超过特征图的边界时,超出的部分会被剪掉)*
328 |
329 | #### 预测与预定位框的异同
330 |
331 | [之前](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#multibox),我们说我们将会使用回归去找到目标定位框的坐标。但是,预定位框不能作为最终的预测框吗?
332 |
333 | 显然它们不能。
334 |
335 | 我再次重申,预定位框*近似地*代表预测的可能性。
336 |
337 | 也就是说**我们把每个预定位框作为一个近似的起始点,然后找出需要调整多少才能获得更精确的定位框预测**
338 |
339 | 因此,每一个定位框与预定位有轻微的偏差,我们的目标就是去计算这个偏差,我们需要一个办法去测量或者说评估这个偏差
340 |
341 | 
342 |
343 | *(译者注:图中文字大致内容,预定位框的坐标和大小`c_x_hat, c_y_hat, w_hat, h_hat`;定位框坐标和大小`c_x, c_y, w, h`)*
344 |
345 | 假设我们用熟悉的中心坐标+大小的坐标表示
346 |
347 | 那么——
348 |
349 | 
350 |
351 | *(译者注:图中文字内容如下,定位框的位置和大小同样能够用与预定位框的偏移量来表示,这些被偏移量所表示的偏差代表预定位框接近定位框需要调整的量)*
352 |
353 | 这回答了我们在这部分[开始提出的问题](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#a-detour)。为了调整每一个预定位框去得到一个更精确的预测结果,**这四个偏移量`(g_c_x, g_c_y, g_w, g_h)`就是回归定位框坐标的形式**。
354 |
355 | 如你所想,每一个偏移量都由相应的预定位框的维度来归一化。这说得通,因为比起小的预定位框,对于更大的预定位框来说,某些偏移量可能不是那么重要。
356 |
357 | ### 预测卷积
358 |
359 | 在前面的部分中,我们定义了6个特征图的预定位框,其有着不同的缩放量和大小。也就是`conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2`, 和`conv11_2`中的预定位框
360 |
361 | 现在,**对于*每个*特征图上的*每个*位置,其中的*每一个*预定位框**,我们需要预测——
362 |
363 | - 定位框的**偏移量`(g_c_x, g_c_y, g_w, g_h)`**。
364 | - 在定位框中的一系列**`n_classes`分数**,其中`n_classes`表示目标的类别数(包含背景这个类别)
365 |
366 | 为了以最简单的方式做到这一点,**在每个特征图上我们需要两个卷积层**——
367 |
368 | - 一个**位置预测**卷积层,其含有`3, 3`的卷积核,在每个位置进行评估(也就是padding和stride参数为`1`),对于在该位置上的*每一个*预定位框都设有`4`个过滤器。
369 |
370 | 这`4`个过滤器负责计算偏移量 `(g_c_x, g_c_y, g_w, g_h)` ,该偏移量来自定位框与预定位框
371 |
372 | - 一个**类别预测**卷积层,其含有`3, 3`的卷积核,在每个位置进行评估(也就是padding和stride参数为1),对于该位置上的每一个预定位框都设有`n_classes`个过滤器。
373 |
374 | 这`n_classes`个过滤器负责计算该预定位框上的一系列`n_classes`分数
375 |
376 | 
377 |
378 |
379 |
380 |
381 |
382 | *(译者注:图片大意:*
383 |
384 | *左:位置卷积,对于特征图中每个位置上的每一个预定位框,预测出的定位框以偏移量的形式给出`g_c_x, g_c_y, g_w, g_h`。该层输出通道为 每个位置上的预定位框数 \* 4*
385 |
386 | *右:预测卷积,对于特征图上每个位置上的每一个预定位框,给出目标类别的得分,这表示的是框内是什么类别,如果预测结果为“background”,就说明框内无目标。该层输出通道为 每个位置的预定位框数 \* 目标类别数)*
387 |
388 | 我们的将过滤器的卷积核设置为`3, 3`
389 |
390 | 我们不太需要卷积核(或者说过滤器)与定位框的形状相同,因为不同的过滤器将会进行关于不同形状的预定位框的预测。
391 |
392 | 接下来我们看看**这些卷积的输出**。再次以`conv9_2`的特征图为例。
393 |
394 | 
395 |
396 | 定位卷积和预测卷积的在图中分别以蓝色和黄色表示。可以看到横截面没有改变(图中指灰色、蓝色、黄色三个部分前两维度相同)。
397 |
398 | 我们真正需要关心的是*第三个*维度,也就是通道数。其中包含了实际预测。
399 |
400 | 图中,如果你**选择任意一个定位预测方块并展开它**,你会看到什么?
401 |
402 | 
403 |
404 | *(译者注:图中文字大意:在每个位置的24个通道代表了6个定位框预测,也就是来自6个预定位框的6组的偏移量,每个偏移量包含四个结果`g_c_x, g_c_y, g_w, g_h`)*
405 |
406 | 同样的,我们对类别预测做同样的操作。假设`n_classes = 3`
407 |
408 | 
409 |
410 | *(译者注:文字大意:假设三种类别`(cat, dog, backgroud)`,在每个位置上的18个通道代表了6个预定位框的6组预测结果,每组预测结果包含3种得分`(cat, dog, bgd)`)*
411 |
412 | 与之前相同,这些通道代表在这个位置上定位框的得分
413 |
414 | 现在我们明白了来自`conv9_2`的特征图的预测结果是什么样的,我们可以把它**重塑为更方便处理的形状**
415 |
416 | 
417 |
418 | *(译者注:文字大意:对FM9_2预测结果重塑,让其表示150个定位框的偏移量和得分)*
419 |
420 | 但我们不仅仅停留在这一步,我们可以对*所有*层的预测结果做相同的处理,然后把他们叠在一起。
421 |
422 | 之前我们计算模型中出一共有8732个预定位框。因此,**一共有8732个定位框预测以偏移量的形式表示,并有8732组类别得分**
423 |
424 | 
425 |
426 | *(译者注:文字大意:重塑后的特征图,并将他们拼接在一起,一共有8732个定位框预测)*
427 |
428 | **这就是预测阶段的最终输出**,一系列定位框的叠加,如果你愿意,你可以估计一下里面是些啥。
429 |
430 | 我们已经到这里了,不是吗?如果这是你首次涉足目标检测,我想这便是星星之火。
431 |
432 | ### 多定位框损失
433 |
434 | 通过我们预测结果的本质,不难看出为什么我们需要这么一个独一无二的损失函数。许多人都在计算回归或者分类中算过损失,但是几乎没有人把这两种损失*结合*起来(如果有这种情况的话)
435 |
436 | 显然,我们的总损失必须是**两种预测损失的总和**——定位框位置和类别得分
437 |
438 | 接下来出现了这些问题
439 |
440 | > *定位框的回归问题采用什么损失函数?*
441 |
442 | > *类别得分的损失函数应该是交叉熵损失吗?*
443 |
444 | > *以何种比例将二者结合?*
445 |
446 | > *如何比较预测框的预测值与真实值*
447 |
448 | > *一共有8732个预测结果!不是大多数都不包含目标吗?我们也需要考虑这些结果吗?*
449 |
450 | 氦,我们得继续上路了
451 |
452 | #### 预测值与真实值的比较
453 |
454 | 记住,监督学习的要点是**我们需要比较预测值与真实值**。这非常棘手,因为目标检测比一般的机器学习任务更加不确定。
455 |
456 | 对于一个模型,学习*任何事物*我们都需要构造一个关于预测值与真实值比较的问题。
457 |
458 | 预定位框恰恰能做到这一点
459 |
460 | - **找到交并比**,这里交并比是指8732个预定位框与`N`个目标真实值的交并比,这是一个大小`8732, N`的张量(tensor)
461 | - 将8732个预定位框中与目标重叠最大的预定位框,与目标配对起来
462 | - 如果一个预定位框与目标配对后的交并比小于`0.5`,那么它就不“含有”目标,因此它就是一个***负*匹配项**。我们有成千上万的预定位框,对于一个目标,许多都将测出是一个负匹配项
463 | - 另一方面,少数预定位框与目标是**明显重叠(大于`0.5`)**的,可以认为其“含有”这个目标。他们就是***正*匹配项**
464 | - 现在我们有**8732个预定位框与一个真实值的配对**,事实上,我们同样有**相应的8732个预测值与1个真实值的配对**
465 |
466 | 让我们用一个例子来重新理解一个这个逻辑
467 |
468 | 
469 |
470 | 为了方便理解,假设只有7个预定位框,在图中以红色表示。目标在黄色方框内——这张图中有3个实实在在的目标。
471 |
472 | 根据之前的大概步骤,产生了下面的配对——
473 |
474 | 
475 |
476 | *(译者注:文字大意:*
477 |
478 | *左边:将每个预定位框与目标具有最大交并比的两项配对*
479 |
480 | *右边:如果一个预定位框与同其配对的目标的交并比大于0.5,他就会含有目标,并作为一个正匹配项。否则,他就是负匹配项,被分配到一个“background”标签)*
481 |
482 | 现在,每一个预定位框都有一个配对,他要么是正匹配项,要么是负匹配项。同理,每一个预测值都有一个配对,或正或负。
483 |
484 | 现在,与目标匹配为正的预测值具有实际的坐标,这些坐标将会作为定位的目标,也就是*回归*任务。自然地,负匹配项中就没有目标坐标。
485 |
486 | 所有的预测值都有一个标签,这个标签要么是目标类别(如果其为正匹配项),要么是*background*(如果其为负匹配项)。这些被当做**类别预测的目标**,也就是*分类*任务
487 |
488 | #### 定位损失
489 |
490 | 对于**负匹配项**,我们**没有真实坐标**。这很好理解,为什么要训练模型在空间中画这么多框呢?
491 |
492 | 正匹配项中预测定位框到真实坐标回归得怎么样,决定了定位损失。
493 |
494 | 我们预测的定位框是以偏移量的形式给出`(g_c_x, g_c_y, g_w, g_h)`,在进行损失计算前,或许还需要把真实坐标也这样编码。
495 |
496 | 定位损失是**Smooth L1**损失在正匹配项中编码后的定位框偏移量与其真实值之间的损失的平均值
497 |
498 | 
499 |
500 | 置信度损失
501 |
502 | 每一个预测值,无论正负,都有一个真实标签与其相关联。在模型识别是否有目标的时候这很重要。
503 |
504 | 无论如何,一张图片中的目标一只手都能数过来,**绝大多数我们的预测结果中不包含一个目标**。正如Walter White所说,*动作要轻*。如果正匹配项消失在茫茫负匹配项中,我们会以这个模型不太可能检测到目标结束,因为它往往学会了检测*background*类。
505 |
506 | 其解决办法很明显——限制参与损失计算的负匹配的数量。但是如何抉择?
507 |
508 | 好吧,为什么不使用模型*错*得最多那些负匹配项?换句话说,仅仅让那些模型认为难以识别这里没有目标的负匹配项。这叫**硬性负样本挖掘(Hard Negative Mining)**
509 |
510 | 假如说我们要使用的硬性负样本数量为`N_hn`,通常是与这张图像正匹配数的固定倍数。在这特定的种情况下,作者决定使用3倍的隐形负样本,也就是说`N_hn = 3 * N_p`。找出负匹配项预测值的交叉熵损失的前`N_hn`个负匹配项(top_k),这些就是最硬的硬性负样本。
511 |
512 | 接下来,置信度损失就是简单的正匹配项,和硬性负样本匹配项**交叉熵损失**的求和
513 |
514 | 
515 |
516 | 请注意损失被正匹配项的数量平均
517 |
518 | #### 总损失
519 |
520 | **多定位框损失是两种损失的综合**,其结合因子为`α`。
521 |
522 | 
523 |
524 | 通常来说,我们不需要决定`α`的值。它是一个可学习的参数。
525 |
526 | 无论如何,对于SSD,作者仅仅只是去`α = 1`,也就是两种损失相加。我们也就这样吧!
527 |
528 | ### 预测结果处理
529 |
530 | 模型在训练之后就能给它喂图片了。但是出来的预测结果是原始形式——两个张量分别表示预定位框的偏移量和类别得分。需要将这些数据处理为**人能看懂的边框和标签的最终形式**。
531 |
532 | 这就需要接下来的操作——
533 |
534 | - 我们有8732个预测定位框,其形式以相应的预定位框偏移量表示`(g_c_x, g_c_y, g_w, g_h)`。把他们解码回边界坐标,也就是能直接看懂的坐标。
535 | - 接下来,对于每个*非background*类。
536 | - 提取出8732个框中每个框的得分。
537 | - 去掉那些分数没达到某个阈值的的预测定位框。
538 | - 剩下的定位框就是该类目标的候选结果
539 |
540 | 到这里,如果你把这些预测定位框在原图上画出来,你能看到**许多高度重叠的边框,它们明显是冗余的**。这是因为在成千上万个预定位框中,及其有可能不止一个预测结果指的是用一个目标。
541 |
542 | 例如下面这张图
543 |
544 | 
545 |
546 | *(译者注:文字大意:通常有好几个预测结果指向同一个目标)*
547 |
548 | 图上清晰可见的三个目标——两只狗和一只猫。但是根据模型给出的结果,其中有*三*只狗和*两*只猫。
549 |
550 | 注意,这只是一个简单的例子。更糟糕的情况是超级多的框。
551 |
552 | 现在,对于你来说,哪些框指的是用一个目标很明显。因为你能认出在特定目标上的框与彼此十分相似。
553 |
554 | 事实上,这是如何做到的呢?
555 |
556 | 首先,**按照*置信度*排列每个类的候选结果**
557 |
558 | 
559 |
560 | 通过其得分来排列他们。
561 |
562 | 下一步是找出哪些候选结果是冗余的。在我们的处理过程中,我们已经有了一个工具可以判断两个框的相似度——交并比。
563 |
564 | 因此,**列出所有候选框在给定类别上的交并比**,我们就能评比每一对结果,并**确定它们是否明显重叠,保留置信度更高的候选结果**
565 |
566 | 
567 |
568 | *(译者注:图片大意:*
569 |
570 | *上:如果两个“狗”的候选结果之间交并比>0.5,它们极有可能是同一只狗!抑制所有行列中置信度更小的候选结果——dog C 就被淘汰了。*
571 |
572 | *下:同样的,cat B 肯定与分数更高的 cat A 是同一只猫,并且它被抑制了。)*
573 |
574 | 这样,我们就消除了候选结果中的离群点——每种动物都有一只。
575 |
576 | 这种处理叫做**非最大抑制(NMS)**,当发现多个候选结果明显相互重叠时,他们有可能指的是同一个目标,**我们抑制了分数最高结果以外的所有候选结果**。
577 |
578 | 从算法上来讲,做如下处理——
579 |
580 | - 当选择每一个*非background*类别的候选结果时,
581 | - 按照置信度递减排列这个类别的候选结果。
582 | - 注意分数最高的候选结果。丢掉与该结果交并比高于某个值,比如说`0.5`,且得分较低的候选结果。
583 | - 注意分数第二高的候选结果还在。丢掉与该结果交并比高于某个值,比如说`0.5`,且得分较低的候选结果。
584 | - 重复直到遍历完整个候选结果队列。
585 |
586 | 最终结果就是你只会得到一个边框——最好的那个——对于图片上的每个类别而言。
587 |
588 | 
589 |
590 | 非最大抑制对获得高质量检测结果来说非常重要。
591 |
592 | 高兴的是,这同样是最后一步。
593 |
594 | # 实现
595 |
596 | 这一部分简要地描述了实现方法。
597 |
598 | 仅供参考,**细节最好从代码中直接理解**,代码注释很详细。
599 |
600 | ### 数据集
601 |
602 | 我们使用VOC2007和2012的数据集
603 |
604 | #### 说明
605 |
606 | 这个数据集包含了20种不同的目标。
607 |
608 | ```python
609 | {'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'}
610 | ```
611 |
612 | 每张图片包含了一个及以上的目标类别。
613 |
614 | 每个目标的表示——
615 |
616 | - 绝对边界坐标表示的定位框
617 | - 标签(上述目标类别的一个)
618 | - 一个感知检测难度(要么`0`,要么`1`,`0`表示不*困难*,`1`表示*困难*)
619 |
620 | #### 下载
621 |
622 | 显然,你需要下载这些数据集
623 |
624 | - [2007 _训练集_](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar) (460MB)
625 |
626 | - [2012 _训练集_](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) (2GB)
627 |
628 | - [2007 _测试集_](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar) (451MB)
629 |
630 | 与论文中一致,这两个*训练集*用于训练数据,VOC 2007 *测试集*将作为我们的测试数据
631 |
632 | 确保你把VOC 2007 *训练集*和2007 *测试集*提取到同一位置,也就是说,把他们合并起来。
633 |
634 | ### 喂数据
635 |
636 | 我们将需要三种数据。
637 |
638 | #### 图片
639 |
640 | 因为我们用的是SSD300这个变体,图片需要被转换到RGB模式下`300, 300`的尺寸。
641 |
642 | 还记得吗,我们用的是基于ImageNet预训练的VGG-16基础模型,它已经内置在了PyTorch的`torchvision`模块。[这个界面](https://pytorch.org/docs/master/torchvision/models.html)详细说明了使用这个模型我们需要做的转换和预处理——像素值必须在[0,1]并且我们不许使用ImageNet图片RGB通道的平均值和标准差来标准化这些值。*(译者注:此处的标准化与正态分布的标准化操作相同)*
643 |
644 | ```python
645 | mean = [0.485, 0.456, 0.406]
646 | std = [0.229, 0.224, 0.225]
647 | # 译者注:mean表示平均值,std表示标准差
648 | ```
649 |
650 | PyTorch也遵循NCHW准则,这意味着维度(C)必须在尺寸维度之前
651 |
652 | 因此,喂给模型的图片必须是一个`N, 3, 300, 300 `维度的`Float`张量,并且必须通过上述的平均值和标准差来标准化。`N`就是批次大小。
653 |
654 | #### 目标的定位框
655 |
656 | 对于每一张图片,我们需要提供其中呈现的实实在在的目标的定位框的坐标,这些坐标以分数形式给出`(x_min, y_min, x_max, y_max)`。
657 |
658 | 由于在任意给出的图片中,目标的数量会改变,我们不能使用一个固定大小的张量来存储整个批次`N`张图的定位框。
659 |
660 | 因此,**喂给模型的实实在在的定位框应当是一个长度为`N`的列表,这个列表中的每个元素都是一个`N_o, 4`维度的`Float`张量**,其中`N_o`是某张图上所呈现的目标总数。
661 |
662 | #### 目标标签
663 |
664 | 对于每张图片,我们需要提供其中呈现的实实在在的标签。
665 |
666 | 每个标签都需要编码为一个整数,从`1`到`20`,代表着20中不同的目标类别。此外,我们将添加一个*background*类别,其索引为`0`,这表示定位框中没有目标。(一般情况下,这个标签不会用来表示数据集中任何实实在在的目标。)
667 |
668 | 同样的,由于在任意给出的图片中,目标的数量会改变,我们不能使用一个固定大小的张量来存储整个批次`N`张图的标签。
669 |
670 | 因此,**喂给模型的实实在在的定位框应当是一个长度为`N`的列表,这个列表中的每个元素都是一个`N_o`维度的`Long`张量**,其中`N_o`是某张图上所呈现的目标总数。
671 |
672 | ### 数据加工
673 |
674 | 你知道的,我们的数据被分为了*训练集*和*测试集*
675 |
676 | #### 分析原始数据
677 |
678 | 你能在 [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py) 中的 `create_data_lists()` 看到。
679 |
680 | 分析下载下来的数据并保存下面这些文件——
681 |
682 | - 一个**JSON文件,对于每个子集,其包含了`I`张图绝对路径的列表**,其中`I`指该子集中图片的总数。
683 | - 一个**JSON文件,对于每个子集,其包含了`I`个字典的列表,字典中包含实实在在的目标,也就是定位框的绝对边界坐标形式,其编码后的标签,和感知检测难度**。列表中的第`i`个字典将包含上一个JSON文件中第`i`张图上所呈现的目标。
684 | - 一个**JSON文件,其中包含了`label_map`**,标签到索引的字典,在上一个JSON文件中用该字典来对标签编码。这个字典可以通过 [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py) 得到,并可以直接导入。
685 |
686 | #### PyTorch数据集
687 |
688 | 你能在 [`datasets.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/datasets.py) 中的`PascalVOCDataset` 看到
689 |
690 | 这是一个Pytorch [`Dataset`](https://pytorch.org/docs/master/data.html#torch.utils.data.Dataset) 的子类,用来**定义我们的训练集和测试集**。他需要定义一个`__len__`方法,其返回数据集的大小,还要定义一个`__getitem__`方法来返回第`i`张图片、这张图中的定位框,和这张图中目标的标签,这个标签可以通过使用我们之前保存的JSON文件。
691 |
692 | 你会注意到它同样返回了每个目标的感知检测难度,但其并不会真正的用于训练模型。仅在[评估](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#evaluation)阶段我们才会用它来计算平均精度(mAP)。我们可以选择完全过滤掉数据集中*困难*的目标,通过牺牲精度来加速训练。
693 |
694 | 此外,在这些类别中,**每张图片和其中的目标都经过了一系列变换**,这个变换在论文中有讲,大致如下。
695 |
696 | #### 数据变换
697 |
698 | 你可以在 [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py) 中的 `transform()` 看到
699 |
700 | 这个函数对图片和其中的目标做如下变换——
701 |
702 | - 随机**调整亮度,对比度,饱和度和色调**,每次都有50%的概率并且是按照随机顺序变换。
703 | - 50%的概率对图片**施加一个缩小操作**。这有助于学习检测小目标。缩小后的图片必须在原始图像的`1`到`4`倍之间。周围的空间可以用ImageNet数据的平均值填充。
704 | - 随机裁剪图片,也就是**施加一个放大操作**。这有助于学习检测大目标或者只有一部分的目标。甚至有些目标会被完全剪掉。裁剪尺寸应当在原图的`0.3`到`1`倍之间。裁剪部分的横纵比应当在`0.5`到`2`之间。每次裁剪应当满足这样的条件:至少剩下一个定位框,并且定位框与被裁剪掉的部分的交并比应当是`0, 0.1, 0.3, 0.5, 0.7, 0.9`中的随机一个。此外,通过裁剪,任意一个剩下的定位框的中心应当不在图片中。当然,同样也有概率不裁剪图片。
705 | - 50%的概率水平翻转图片。
706 | - 将图片**重塑(resize)**到`300, 300`像素。SSD300要求这样。
707 | - 将所有定位框**从绝对坐标转换为分数边界坐标**。在我们模型的所有阶段,所有的边界坐标和中心-大小坐标都将是他们的分数形式。
708 | - 用ImageNet数据的平均值和标准差**标准化**图片,ImageNet是用于预训练VGG base的数据。
709 |
710 | 正如论文中所提到的,这些变换在取得既定结果上有这至关重要的作用。
711 |
712 | #### Pytorch DataLoader
713 |
714 | 之前说的`Dataset`就是`PascalVOCDataset`,将在 `train.py` 通过 PyTorch [`DataLoader`](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) 来**创建多批数据来喂给模型**,这几批数据用于训练或是评估。
715 |
716 | 因为不同图片上的目标数量不同,他们的定位框,标签,和难度不能单纯地叠加进批次中。这会把哪些目标属于哪些图片混为一谈。
717 |
718 | 相反,我们需要在构建`DataLoader`时**在`collate_fn`参数中传入一个整理函数**,这个函数是关于如何把这些大小不定的张量结合起来的。最好的选择或许是Python的列表。
719 |
720 | ### 基础卷积
721 |
722 | 你能在 [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py) 中的 `VGGBase` 看到。
723 |
724 | 这里,我们**创建并应用基础卷积**
725 |
726 | 这几层是通过VGG-16的参数初始化的,这个初始化在`load_pretrained_layers()` 方法中。
727 |
728 | 我们要特别注意低维特征图,这些特征图是来自`conv4_3`和`conv7`的结果,我们把它返回出来供之后的阶段使用。
729 |
730 | ### 辅助卷积
731 |
732 | 你能在 [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py) 中的 `AuxiliaryConvolutions` 看到。
733 |
734 | 这里,我们**创建并应用辅助卷积**
735 |
736 | 使用[Xavier初始化(uniform Xavier initialization)](https://pytorch.org/docs/stable/nn.html#torch.nn.init.xavier_uniform_)来初始化这几层的参数。
737 |
738 | 我们要特别注意高维特征图,这些特征图是来自`conv8_2`, `conv9_2`, `conv10_2` 和`conv11_2`的结果,我们把它返回出来供之后的阶段使用。
739 |
740 | ### 预测卷积
741 |
742 | 你能在 [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py) 中的 `PredictionConvolutions` 看到。
743 |
744 | 这里,我们**创建并应用定位卷积核类别预测卷积**,输入是来自 `conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2` 和`conv11_2`的特征图。
745 |
746 | 这几层与辅助卷积的初始化方式相同。
747 |
748 | 我们还会**重塑预测结果图并拼接他们**,就像我们说过的那样。注意,当且仅当原来的张量存储内存中[相邻](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.contiguous)的块时,重塑(reshaping)在PyTorch中才能使用。
749 |
750 | 不出意外的话,拼接后的位置预测和类别预测的维度分别是`8732, 4`和`8732, 21`。
751 |
752 | ### 组装模型
753 |
754 | 你能在 [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py) 中的 `SSD300` 看到。
755 |
756 | 这里,基础、辅助和预测卷积都会结合起来,形成SSD。
757 |
758 | 有个小细节——最低维的特征也就是来自`conv4_3`的特征,比起它的高维特征,它会在一个显著不同的缩放尺寸上。因此,作者推荐使用L2正则化并以一个可学习的值重新缩放它的*每个*通道。
759 |
760 | ### 预定位框
761 |
762 | 你能在 [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py) 中的 `SSD300` 下的`create_prior_boxes()`看到。
763 |
764 | 这个函数**以中心-大小坐标创建预定位框**,特征图上的预定位框创建顺序为`conv4_3`, `conv7`, `conv8_2`, `conv9_2`, `conv10_2` and `conv11_2`。并且,对于每一个特征图,我们按行遍历其中的每一块。
765 |
766 | 按这种顺序得到了8732个预定位框非常重要,因为他需要与叠加后的预测结果顺序相同。
767 |
768 | ### 多定位框损失
769 |
770 | 你能在 [`model.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/model.py) 中的 `MultiBoxLoss` 看到。
771 |
772 | 对于每张图片上的8732个预测定位框,创建了两个空张量来存储定位和类别预测的目标,也就是*真值*。
773 |
774 | 我们**通过每个预定位框交并比的最大值找到目标真值**,把它们存入`object_for_each_prior`。
775 |
776 | 我们希望避免一种罕见的情况,并非所有的真值都被匹配到。因此,我们还需要找到与每个真值有最大交并比的预定位框,把它存入`prior_for_each_object`。我们明确的把这些匹配项添加到`object_for_each_prior`中,并手动设置它们的重叠度高于阈值,这样它们就不会被扔掉。
777 |
778 | 对于8732个预定位框的每一个,基于 `object_for_each prior`中的匹配项,我们为其设置相应的标签,也就是**类别预测的目标**。对于哪些不明显重合于它们所匹配的目标的预定位框,将标签设置为*background*。
779 |
780 | 此外,我们还要把8732个预定位框在 `object_for_each prior`中的坐标编码为偏移量的形式 `(g_c_x, g_c_y, g_w, g_h)` 来构成**定位的目标**。并非所有的8732个定位目标都是有意义。就像我们之前讨论的那样。仅仅只有那些来自非背景的预定位框会用来拟合他们的目标。
781 |
782 | 定位损失是在正匹配项上的[平滑L1损失](https://pytorch.org/docs/stable/nn.html#torch.nn.SmoothL1Loss)。
783 |
784 | 接下来做硬性负样本挖掘——按照单独的[交叉熵损失](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss),给类别预测与*background*的匹配项,也就是负匹配项排序。**置信度损失**就是正匹配项与硬性负匹配项的交叉熵损失。但是这个损失仅仅只被负匹配项的数量所平均。
785 |
786 | **多定位框损失就是两种损失的综合**,按照一个比例`α`结合,在我们这里,简单的取`α = 1`让他们相加。
787 |
788 | # 训练
789 |
790 | 开始之前,确保你保存了训练和测试需要的文件。可以运行[`create_data_lists.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/create_data_lists.py),并指定其中的 `VOC2007` 和 `VOC2012` 的数据集文件夹,[下载数据](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#download)已经说明了这些数据集。
791 |
792 | 接下来看**[`train.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/train.py)**
793 |
794 | 这个模型的参数(和训练)在文件的开头,因此你能非常轻松地根据你的需要检查和修改。
795 |
796 | 运行这个文件以**从头开始训练模型**——
797 |
798 | `python train.py`
799 |
800 | 在 `checkpoint` 参数中指定相应的文件以**将模型恢复到一个检查点(checkpoint)**
801 |
802 | ### 注意
803 |
804 | 在论文中。他们建议在32张图一个批次的情况下,使用**随机梯度下降法(Stochastic Gradient Descent STG)**,初始学习率`1e−3`,动量(momentum)`0.9`,和`5e-4`的权重衰减。
805 |
806 | 为了提高稳定性,我最终使用`8`张图一个批次。如果你发现了梯度爆炸,你可以尝试减少批次大小,就像我做的那样,也可以梯度截断。
807 |
808 | 作者还将偏执项的学习率提高了一倍。你在代码中也能看见,这在Pytorch中很简单,通过将[单独的参数组](https://pytorch.org/docs/stable/optim.html#per-parameter-options)传入[SGD 优化器(optimizer)](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD)的`params`参数。
809 |
810 | 论文推荐以出示学习率训练迭代80000次,然后减小90%(也就是一成)再额外训练迭代20000次,*重复两次*。在论文`32`批次大小的情况下,这意味着学习率在第154次epoch减小90%,并在第193次epoch再次减小90%,最后训练停止在了第232次epoch。我和这个安排一样。
811 |
812 | 在TitanX(Pascal)上,每个epoch训练需要大约6min。*(译者注:TitanX(Pascal)是一款GPU)*
813 |
814 | 我需要指出一点,本教程的读者让我注意到意想不到的与论文中的两个差异:
815 |
816 | - 我的定位框超过图片的部分没有被剪掉,由 *@AakiraOtok* 在[#94](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/94)中指出。然而,通过在这个问题下的讨论和该读者[#95](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/95)的issue中的证明,这似乎对模型效果没有负面影响,甚至有可能对模型效果有轻微的提升,但是效果甚微。
817 | - 在定位误差中我错误的使用了L1损失。而不是*平滑*L1损失,由_jonathan016_在[#60](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/60) 中提出。 正如issue中所指出的那样,这同样对模型效果没有负面影响,但在更大的批次大小中,*平滑*L1损失或许能提供更好的训练稳定性,正如[这个评论](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/issues/94#issuecomment-1590217018)所说。
818 |
819 | ### 模型检查点
820 |
821 | 你可以在这里[下载](https://drive.google.com/open?id=1bvJfF6r_zYl2xZEpYXxgb7jLQHFZ01Qe)预训练模型
822 |
823 | 注意这个检查点应当[直接由Pytorch加载](https://pytorch.org/docs/stable/torch.html?#torch.load)用于评估或是推理——如下所示
824 |
825 | # 评估
826 |
827 | 对应 [`eval.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/eval.py)。
828 |
829 | 用于评估的数据加载和检查点参数在这个文件的开头,因此你能非常轻松地根据你的需要检查和修改。
830 |
831 | 要开始评估,用数据加载器(data-loader)和模型检查点运行一下`evaluate()`这个函数就行了。**测试集每张图的原始预测结果**都可以通过检查点的`detect_objects()`方法**获得并解析**,其已经在[这个程序](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#processing-predictions)中实现了。评估会以`min_score`为`0.01`,非最大抑制的`max_overlap`为`0.45`并且`top_k`为`200`来保证结果的公平比较,就像论文和其他实现那样。
832 |
833 | **根据准确值评估解析后的预测值。**其评估方法为*平均精确率(mAP)*。如果你不熟悉这个方法,[这篇文章](https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173)能让你茅塞顿开。
834 |
835 | 我们使用[`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py)中的`calculate_mAP()`来完成这一点。按照惯例,在mAP的计算中,我们将忽略*困难*的检测。但把他们包含在评估数据集中很重要,因为如果模型检测出一个它认为*困难*的目标,这个预测结果就必须当做假正例(false positive)处理。
836 |
837 | 模型的得分是**77.2mAP**,与论文中说的相同。
838 |
839 | 按照类别划分的平均准确率(没有按照比例缩放到100)在下面列出。
840 |
841 | | 类别 | 平均准确率 |
842 | | :-----------: | :-----------------: |
843 | | _aeroplane_ | 0.7887580990791321 |
844 | | _bicycle_ | 0.8351995348930359 |
845 | | _bird_ | 0.7623348236083984 |
846 | | _boat_ | 0.7218425273895264 |
847 | | _bottle_ | 0.45978495478630066 |
848 | | _bus_ | 0.8705356121063232 |
849 | | _car_ | 0.8655831217765808 |
850 | | _cat_ | 0.8828985095024109 |
851 | | _chair_ | 0.5917483568191528 |
852 | | _cow_ | 0.8255912661552429 |
853 | | _diningtable_ | 0.756867527961731 |
854 | | _dog_ | 0.856262743473053 |
855 | | _horse_ | 0.8778411149978638 |
856 | | _motorbike_ | 0.8316892385482788 |
857 | | _person_ | 0.7884440422058105 |
858 | | _pottedplant_ | 0.5071538090705872 |
859 | | _sheep_ | 0.7936667799949646 |
860 | | _sofa_ | 0.7998116612434387 |
861 | | _train_ | 0.8655905723571777 |
862 | | _tvmonitor_ | 0.7492395043373108 |
863 |
864 | 你可以发现,像bottles和potted plants这样的东西比其他类别检测起来难得多。
865 |
866 | # 推理
867 |
868 | 见 [`detect.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/detect.py)。
869 |
870 | 在代码的顶部指定你想要用来推理模型的的`checkpiont`参数。
871 |
872 | 接下来,可以使用`detect()`函数来识别图片并将其可视化。
873 |
874 | ```python
875 | img_path = '/path/to/ima.ge'
876 | original_image = PIL.Image.open(img_path, mode='r')
877 | original_image = original_image.convert('RGB')
878 |
879 | detect(original_image, min_score=0.2, max_overlap=0.5, top_k=200).show()
880 | ```
881 |
882 | 由于模型需要,这个函数先通过**重塑并标准化图片的RGB通道来**处理它。然后获取到了模型的**原始预测结果**,通过`detect_objects()`方法解析。解析后的结果将被从分数形式转换为绝对边界坐标,预测结果的标签都是由`label_map`所编码的,最后在**图片上可视化**。
883 |
884 | `min_score`, `max_overlap`和`top_k`都没有一个通用的值,你需要一点点实验去找到最适合你的目标的数据。
885 |
886 | ### 更多例子
887 |
888 | ---
889 |
890 |
891 |
892 |
893 |
894 | ---
895 |
896 |
897 |
898 |
899 |
900 | ---
901 |
902 |
903 |
904 |
905 |
906 | ---
907 |
908 |
909 |
910 |
911 |
912 | ---
913 |
914 |
915 |
916 |
917 |
918 | ---
919 |
920 |
921 |
922 |
923 |
924 | ---
925 |
926 |
927 |
928 |
929 |
930 | ---
931 |
932 |
933 |
934 |
935 |
936 | ---
937 |
938 |
939 |
940 |
941 |
942 | ---
943 |
944 |
945 |
946 |
947 |
948 | ---
949 |
950 |
951 |
952 |
953 |
954 | ---
955 |
956 | # FAQs
957 |
958 | **我注意到预定位框通常超过了在预测卷积中`3, 3`的卷积核。卷积核是如何检测到(目标的)边界的?**
959 |
960 | 不要对卷积核和他的*感受域*有疑惑,其感受域是原始图像中的一些区域,这些区域代表着卷积核的视野。
961 |
962 | 例如,在来自`conv4_3`的`38, 38`的特征图上,一个`3, 3`的卷积核覆盖了`0.08, 0.08`的分数坐标区域。那么其中的预定位框就是 `0.1, 0.1`, `0.14, 0.07`, `0.07, 0.14`和 `0.14, 0.14`。
963 |
964 | 但它的感受域,[是可以计算的](https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807),是惊人的`0.36, 0.36`!因此,所有预定位框(目标包含在其中)在其内部展示得很好。
965 |
966 | 注意,视野域是通过每一次连续的卷积成长的。对于`conv_7`以及更高维的特征图,一个`3, 3`的卷积核的视野域将覆盖*整个*`300, 300`的图片。但,通常情况下,原图中靠近卷积核中心的像素会有更高的权重,所以从某种意义上说它任然是局部。
967 |
968 | ---
969 |
970 | **训练的时候,为什么不能直接预测定位框与目标直接配对呢?**
971 |
972 | 我们不能直接通过预测定位框和目标的重叠或者重合来直接匹配它们,因为预测定位框被认为是不可靠的,*特别*是训练过程中。这正是我们试图首先评估它们的原因!
973 |
974 | 这也是预定位框非常有用的原因。我们能将一个预测定位框与一个目标定位框配对,因为预测定位框应与其相似。预测的正确与否已经不重要了。
975 |
976 | ---
977 |
978 | **为什么检查只达到了*non-background*类阈值的结果中任然有*background*类?**
979 |
980 | 当没有目标在预定位框的近似域内,分数高的*background*将稀释其他类别的得分,因为他们没有达到检测阈值。
981 |
982 | ---
983 |
984 | **为什么不简单地选择最高得分的类别而不是使用一个阈值**?
985 |
986 | 我认为这是一个有效的策略,毕竟我们在训练模型时,通过交叉熵损失要求模型只选择*一个*类别。但你会发现,你不会达到和使用阈值一样的效果。
987 |
988 | 我怀疑这是因为目标检测是足够开放的,以至于在训练模型中对于预定位框中的内容存在疑问。例如,如果预定位框内有大量背景可见,*background*的分数可能很高。甚至可能有多种目标在同一个近似域呢。一个简单的阈值将得出所有我们考虑的可能,它就是这样效果更好。
989 |
990 | 冗余的检测并不真的是一个问题,因为我们用NMS送走了他们。
991 |
992 | ---
993 |
994 | **虽然但是...*[盒盒盒盒子里面是啥?!](https://cnet4.cbsistatic.com/img/cLD5YVGT9pFqx61TuMtcSBtDPyY=/570x0/2017/01/14/6d8103f7-a52d-46de-98d0-56d0e9d79804/se7en.png)***
995 |
996 | 哈
997 |
998 |
999 |
1000 |
--------------------------------------------------------------------------------
/create_data_lists.py:
--------------------------------------------------------------------------------
1 | from utils import create_data_lists
2 |
3 | if __name__ == '__main__':
4 | create_data_lists(voc07_path='/media/ssd/ssd data/VOC2007',
5 | voc12_path='/media/ssd/ssd data/VOC2012',
6 | output_folder='./')
7 |
--------------------------------------------------------------------------------
/datasets.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch.utils.data import Dataset
3 | import json
4 | import os
5 | from PIL import Image
6 | from utils import transform
7 |
8 |
9 | class PascalVOCDataset(Dataset):
10 | """
11 | A PyTorch Dataset class to be used in a PyTorch DataLoader to create batches.
12 | """
13 |
14 | def __init__(self, data_folder, split, keep_difficult=False):
15 | """
16 | :param data_folder: folder where data files are stored
17 | :param split: split, one of 'TRAIN' or 'TEST'
18 | :param keep_difficult: keep or discard objects that are considered difficult to detect?
19 | """
20 | self.split = split.upper()
21 |
22 | assert self.split in {'TRAIN', 'TEST'}
23 |
24 | self.data_folder = data_folder
25 | self.keep_difficult = keep_difficult
26 |
27 | # Read data files
28 | with open(os.path.join(data_folder, self.split + '_images.json'), 'r') as j:
29 | self.images = json.load(j)
30 | with open(os.path.join(data_folder, self.split + '_objects.json'), 'r') as j:
31 | self.objects = json.load(j)
32 |
33 | assert len(self.images) == len(self.objects)
34 |
35 | def __getitem__(self, i):
36 | # Read image
37 | image = Image.open(self.images[i], mode='r')
38 | image = image.convert('RGB')
39 |
40 | # Read objects in this image (bounding boxes, labels, difficulties)
41 | objects = self.objects[i]
42 | boxes = torch.FloatTensor(objects['boxes']) # (n_objects, 4)
43 | labels = torch.LongTensor(objects['labels']) # (n_objects)
44 | difficulties = torch.ByteTensor(objects['difficulties']) # (n_objects)
45 |
46 | # Discard difficult objects, if desired
47 | if not self.keep_difficult:
48 | boxes = boxes[1 - difficulties]
49 | labels = labels[1 - difficulties]
50 | difficulties = difficulties[1 - difficulties]
51 |
52 | # Apply transformations
53 | image, boxes, labels, difficulties = transform(image, boxes, labels, difficulties, split=self.split)
54 |
55 | return image, boxes, labels, difficulties
56 |
57 | def __len__(self):
58 | return len(self.images)
59 |
60 | def collate_fn(self, batch):
61 | """
62 | Since each image may have a different number of objects, we need a collate function (to be passed to the DataLoader).
63 |
64 | This describes how to combine these tensors of different sizes. We use lists.
65 |
66 | Note: this need not be defined in this Class, can be standalone.
67 |
68 | :param batch: an iterable of N sets from __getitem__()
69 | :return: a tensor of images, lists of varying-size tensors of bounding boxes, labels, and difficulties
70 | """
71 |
72 | images = list()
73 | boxes = list()
74 | labels = list()
75 | difficulties = list()
76 |
77 | for b in batch:
78 | images.append(b[0])
79 | boxes.append(b[1])
80 | labels.append(b[2])
81 | difficulties.append(b[3])
82 |
83 | images = torch.stack(images, dim=0)
84 |
85 | return images, boxes, labels, difficulties # tensor (N, 3, 300, 300), 3 lists of N tensors each
86 |
--------------------------------------------------------------------------------
/detect.py:
--------------------------------------------------------------------------------
1 | from torchvision import transforms
2 | from utils import *
3 | from PIL import Image, ImageDraw, ImageFont
4 |
5 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
6 |
7 | # Load model checkpoint
8 | checkpoint = 'checkpoint_ssd300.pth.tar'
9 | checkpoint = torch.load(checkpoint)
10 | start_epoch = checkpoint['epoch'] + 1
11 | print('\nLoaded checkpoint from epoch %d.\n' % start_epoch)
12 | model = checkpoint['model']
13 | model = model.to(device)
14 | model.eval()
15 |
16 | # Transforms
17 | resize = transforms.Resize((300, 300))
18 | to_tensor = transforms.ToTensor()
19 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
20 | std=[0.229, 0.224, 0.225])
21 |
22 |
23 | def detect(original_image, min_score, max_overlap, top_k, suppress=None):
24 | """
25 | Detect objects in an image with a trained SSD300, and visualize the results.
26 |
27 | :param original_image: image, a PIL Image
28 | :param min_score: minimum threshold for a detected box to be considered a match for a certain class
29 | :param max_overlap: maximum overlap two boxes can have so that the one with the lower score is not suppressed via Non-Maximum Suppression (NMS)
30 | :param top_k: if there are a lot of resulting detection across all classes, keep only the top 'k'
31 | :param suppress: classes that you know for sure cannot be in the image or you do not want in the image, a list
32 | :return: annotated image, a PIL Image
33 | """
34 |
35 | # Transform
36 | image = normalize(to_tensor(resize(original_image)))
37 |
38 | # Move to default device
39 | image = image.to(device)
40 |
41 | # Forward prop.
42 | predicted_locs, predicted_scores = model(image.unsqueeze(0))
43 |
44 | # Detect objects in SSD output
45 | det_boxes, det_labels, det_scores = model.detect_objects(predicted_locs, predicted_scores, min_score=min_score,
46 | max_overlap=max_overlap, top_k=top_k)
47 |
48 | # Move detections to the CPU
49 | det_boxes = det_boxes[0].to('cpu')
50 |
51 | # Transform to original image dimensions
52 | original_dims = torch.FloatTensor(
53 | [original_image.width, original_image.height, original_image.width, original_image.height]).unsqueeze(0)
54 | det_boxes = det_boxes * original_dims
55 |
56 | # Decode class integer labels
57 | det_labels = [rev_label_map[l] for l in det_labels[0].to('cpu').tolist()]
58 |
59 | # If no objects found, the detected labels will be set to ['0.'], i.e. ['background'] in SSD300.detect_objects() in model.py
60 | if det_labels == ['background']:
61 | # Just return original image
62 | return original_image
63 |
64 | # Annotate
65 | annotated_image = original_image
66 | draw = ImageDraw.Draw(annotated_image)
67 | font = ImageFont.truetype("./calibril.ttf", 15)
68 |
69 | # Suppress specific classes, if needed
70 | for i in range(det_boxes.size(0)):
71 | if suppress is not None:
72 | if det_labels[i] in suppress:
73 | continue
74 |
75 | # Boxes
76 | box_location = det_boxes[i].tolist()
77 | draw.rectangle(xy=box_location, outline=label_color_map[det_labels[i]])
78 | draw.rectangle(xy=[l + 1. for l in box_location], outline=label_color_map[
79 | det_labels[i]]) # a second rectangle at an offset of 1 pixel to increase line thickness
80 | # draw.rectangle(xy=[l + 2. for l in box_location], outline=label_color_map[
81 | # det_labels[i]]) # a third rectangle at an offset of 1 pixel to increase line thickness
82 | # draw.rectangle(xy=[l + 3. for l in box_location], outline=label_color_map[
83 | # det_labels[i]]) # a fourth rectangle at an offset of 1 pixel to increase line thickness
84 |
85 | # Text
86 | text_size = font.getsize(det_labels[i].upper())
87 | text_location = [box_location[0] + 2., box_location[1] - text_size[1]]
88 | textbox_location = [box_location[0], box_location[1] - text_size[1], box_location[0] + text_size[0] + 4.,
89 | box_location[1]]
90 | draw.rectangle(xy=textbox_location, fill=label_color_map[det_labels[i]])
91 | draw.text(xy=text_location, text=det_labels[i].upper(), fill='white',
92 | font=font)
93 | del draw
94 |
95 | return annotated_image
96 |
97 |
98 | if __name__ == '__main__':
99 | img_path = '/media/ssd/ssd data/VOC2007/JPEGImages/000001.jpg'
100 | original_image = Image.open(img_path, mode='r')
101 | original_image = original_image.convert('RGB')
102 | detect(original_image, min_score=0.2, max_overlap=0.5, top_k=200).show()
103 |
--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
1 | from utils import *
2 | from datasets import PascalVOCDataset
3 | from tqdm import tqdm
4 | from pprint import PrettyPrinter
5 |
6 | # Good formatting when printing the APs for each class and mAP
7 | pp = PrettyPrinter()
8 |
9 | # Parameters
10 | data_folder = './'
11 | keep_difficult = True # difficult ground truth objects must always be considered in mAP calculation, because these objects DO exist!
12 | batch_size = 64
13 | workers = 4
14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
15 | checkpoint = './checkpoint_ssd300.pth.tar'
16 |
17 | # Load model checkpoint that is to be evaluated
18 | checkpoint = torch.load(checkpoint)
19 | model = checkpoint['model']
20 | model = model.to(device)
21 |
22 | # Switch to eval mode
23 | model.eval()
24 |
25 | # Load test data
26 | test_dataset = PascalVOCDataset(data_folder,
27 | split='test',
28 | keep_difficult=keep_difficult)
29 | test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False,
30 | collate_fn=test_dataset.collate_fn, num_workers=workers, pin_memory=True)
31 |
32 |
33 | def evaluate(test_loader, model):
34 | """
35 | Evaluate.
36 |
37 | :param test_loader: DataLoader for test data
38 | :param model: model
39 | """
40 |
41 | # Make sure it's in eval mode
42 | model.eval()
43 |
44 | # Lists to store detected and true boxes, labels, scores
45 | det_boxes = list()
46 | det_labels = list()
47 | det_scores = list()
48 | true_boxes = list()
49 | true_labels = list()
50 | true_difficulties = list() # it is necessary to know which objects are 'difficult', see 'calculate_mAP' in utils.py
51 |
52 | with torch.no_grad():
53 | # Batches
54 | for i, (images, boxes, labels, difficulties) in enumerate(tqdm(test_loader, desc='Evaluating')):
55 | images = images.to(device) # (N, 3, 300, 300)
56 |
57 | # Forward prop.
58 | predicted_locs, predicted_scores = model(images)
59 |
60 | # Detect objects in SSD output
61 | det_boxes_batch, det_labels_batch, det_scores_batch = model.detect_objects(predicted_locs, predicted_scores,
62 | min_score=0.01, max_overlap=0.45,
63 | top_k=200)
64 | # Evaluation MUST be at min_score=0.01, max_overlap=0.45, top_k=200 for fair comparision with the paper's results and other repos
65 |
66 | # Store this batch's results for mAP calculation
67 | boxes = [b.to(device) for b in boxes]
68 | labels = [l.to(device) for l in labels]
69 | difficulties = [d.to(device) for d in difficulties]
70 |
71 | det_boxes.extend(det_boxes_batch)
72 | det_labels.extend(det_labels_batch)
73 | det_scores.extend(det_scores_batch)
74 | true_boxes.extend(boxes)
75 | true_labels.extend(labels)
76 | true_difficulties.extend(difficulties)
77 |
78 | # Calculate mAP
79 | APs, mAP = calculate_mAP(det_boxes, det_labels, det_scores, true_boxes, true_labels, true_difficulties)
80 |
81 | # Print AP for each class
82 | pp.pprint(APs)
83 |
84 | print('\nMean Average Precision (mAP): %.3f' % mAP)
85 |
86 |
87 | if __name__ == '__main__':
88 | evaluate(test_loader, model)
89 |
--------------------------------------------------------------------------------
/img/000001.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000001.jpg
--------------------------------------------------------------------------------
/img/000022.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000022.jpg
--------------------------------------------------------------------------------
/img/000029.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000029.jpg
--------------------------------------------------------------------------------
/img/000045.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000045.jpg
--------------------------------------------------------------------------------
/img/000062.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000062.jpg
--------------------------------------------------------------------------------
/img/000069.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000069.jpg
--------------------------------------------------------------------------------
/img/000075.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000075.jpg
--------------------------------------------------------------------------------
/img/000082.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000082.jpg
--------------------------------------------------------------------------------
/img/000085.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000085.jpg
--------------------------------------------------------------------------------
/img/000092.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000092.jpg
--------------------------------------------------------------------------------
/img/000098.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000098.jpg
--------------------------------------------------------------------------------
/img/000100.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000100.jpg
--------------------------------------------------------------------------------
/img/000116.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000116.jpg
--------------------------------------------------------------------------------
/img/000124.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000124.jpg
--------------------------------------------------------------------------------
/img/000127.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000127.jpg
--------------------------------------------------------------------------------
/img/000128.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000128.jpg
--------------------------------------------------------------------------------
/img/000139.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000139.jpg
--------------------------------------------------------------------------------
/img/000144.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000144.jpg
--------------------------------------------------------------------------------
/img/000145.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/000145.jpg
--------------------------------------------------------------------------------
/img/auxconv.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/auxconv.jpg
--------------------------------------------------------------------------------
/img/baseball.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/baseball.gif
--------------------------------------------------------------------------------
/img/bc1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/bc1.PNG
--------------------------------------------------------------------------------
/img/bc2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/bc2.PNG
--------------------------------------------------------------------------------
/img/confloss.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/confloss.jpg
--------------------------------------------------------------------------------
/img/cs.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/cs.PNG
--------------------------------------------------------------------------------
/img/ecs1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/ecs1.PNG
--------------------------------------------------------------------------------
/img/ecs2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/ecs2.PNG
--------------------------------------------------------------------------------
/img/fcconv1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/fcconv1.jpg
--------------------------------------------------------------------------------
/img/fcconv2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/fcconv2.jpg
--------------------------------------------------------------------------------
/img/fcconv3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/fcconv3.jpg
--------------------------------------------------------------------------------
/img/fcconv4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/fcconv4.jpg
--------------------------------------------------------------------------------
/img/incomplete.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/incomplete.jpg
--------------------------------------------------------------------------------
/img/jaccard.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/jaccard.jpg
--------------------------------------------------------------------------------
/img/locloss.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/locloss.jpg
--------------------------------------------------------------------------------
/img/matching1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/matching1.PNG
--------------------------------------------------------------------------------
/img/matching2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/matching2.jpg
--------------------------------------------------------------------------------
/img/modifiedvgg.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/modifiedvgg.PNG
--------------------------------------------------------------------------------
/img/nms1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/nms1.PNG
--------------------------------------------------------------------------------
/img/nms2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/nms2.PNG
--------------------------------------------------------------------------------
/img/nms3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/nms3.jpg
--------------------------------------------------------------------------------
/img/nms4.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/nms4.PNG
--------------------------------------------------------------------------------
/img/predconv1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/predconv1.jpg
--------------------------------------------------------------------------------
/img/predconv2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/predconv2.jpg
--------------------------------------------------------------------------------
/img/predconv3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/predconv3.jpg
--------------------------------------------------------------------------------
/img/predconv4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/predconv4.jpg
--------------------------------------------------------------------------------
/img/priors1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/priors1.jpg
--------------------------------------------------------------------------------
/img/priors2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/priors2.jpg
--------------------------------------------------------------------------------
/img/reshaping1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/reshaping1.jpg
--------------------------------------------------------------------------------
/img/reshaping2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/reshaping2.jpg
--------------------------------------------------------------------------------
/img/totalloss.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/totalloss.jpg
--------------------------------------------------------------------------------
/img/vgg16.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/vgg16.PNG
--------------------------------------------------------------------------------
/img/wh1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/wh1.jpg
--------------------------------------------------------------------------------
/img/wh2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/5d5b75968c3cf1ffd2718bf828a7560b7ee8e8b5/img/wh2.jpg
--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
1 | from torch import nn
2 | from utils import *
3 | import torch.nn.functional as F
4 | from math import sqrt
5 | from itertools import product as product
6 | import torchvision
7 |
8 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
9 |
10 |
11 | class VGGBase(nn.Module):
12 | """
13 | VGG base convolutions to produce lower-level feature maps.
14 | """
15 |
16 | def __init__(self):
17 | super(VGGBase, self).__init__()
18 |
19 | # Standard convolutional layers in VGG16
20 | self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1) # stride = 1, by default
21 | self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
22 | self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
23 |
24 | self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
25 | self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
26 | self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
27 |
28 | self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
29 | self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
30 | self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
31 | self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # ceiling (not floor) here for even dims
32 |
33 | self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
34 | self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
35 | self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
36 | self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)
37 |
38 | self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
39 | self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
40 | self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
41 | self.pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1) # retains size because stride is 1 (and padding)
42 |
43 | # Replacements for FC6 and FC7 in VGG16
44 | self.conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6) # atrous convolution
45 |
46 | self.conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
47 |
48 | # Load pretrained layers
49 | self.load_pretrained_layers()
50 |
51 | def forward(self, image):
52 | """
53 | Forward propagation.
54 |
55 | :param image: images, a tensor of dimensions (N, 3, 300, 300)
56 | :return: lower-level feature maps conv4_3 and conv7
57 | """
58 | out = F.relu(self.conv1_1(image)) # (N, 64, 300, 300)
59 | out = F.relu(self.conv1_2(out)) # (N, 64, 300, 300)
60 | out = self.pool1(out) # (N, 64, 150, 150)
61 |
62 | out = F.relu(self.conv2_1(out)) # (N, 128, 150, 150)
63 | out = F.relu(self.conv2_2(out)) # (N, 128, 150, 150)
64 | out = self.pool2(out) # (N, 128, 75, 75)
65 |
66 | out = F.relu(self.conv3_1(out)) # (N, 256, 75, 75)
67 | out = F.relu(self.conv3_2(out)) # (N, 256, 75, 75)
68 | out = F.relu(self.conv3_3(out)) # (N, 256, 75, 75)
69 | out = self.pool3(out) # (N, 256, 38, 38), it would have been 37 if not for ceil_mode = True
70 |
71 | out = F.relu(self.conv4_1(out)) # (N, 512, 38, 38)
72 | out = F.relu(self.conv4_2(out)) # (N, 512, 38, 38)
73 | out = F.relu(self.conv4_3(out)) # (N, 512, 38, 38)
74 | conv4_3_feats = out # (N, 512, 38, 38)
75 | out = self.pool4(out) # (N, 512, 19, 19)
76 |
77 | out = F.relu(self.conv5_1(out)) # (N, 512, 19, 19)
78 | out = F.relu(self.conv5_2(out)) # (N, 512, 19, 19)
79 | out = F.relu(self.conv5_3(out)) # (N, 512, 19, 19)
80 | out = self.pool5(out) # (N, 512, 19, 19), pool5 does not reduce dimensions
81 |
82 | out = F.relu(self.conv6(out)) # (N, 1024, 19, 19)
83 |
84 | conv7_feats = F.relu(self.conv7(out)) # (N, 1024, 19, 19)
85 |
86 | # Lower-level feature maps
87 | return conv4_3_feats, conv7_feats
88 |
89 | def load_pretrained_layers(self):
90 | """
91 | As in the paper, we use a VGG-16 pretrained on the ImageNet task as the base network.
92 | There's one available in PyTorch, see https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.vgg16
93 | We copy these parameters into our network. It's straightforward for conv1 to conv5.
94 | However, the original VGG-16 does not contain the conv6 and con7 layers.
95 | Therefore, we convert fc6 and fc7 into convolutional layers, and subsample by decimation. See 'decimate' in utils.py.
96 | """
97 | # Current state of base
98 | state_dict = self.state_dict()
99 | param_names = list(state_dict.keys())
100 |
101 | # Pretrained VGG base
102 | pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()
103 | pretrained_param_names = list(pretrained_state_dict.keys())
104 |
105 | # Transfer conv. parameters from pretrained model to current model
106 | for i, param in enumerate(param_names[:-4]): # excluding conv6 and conv7 parameters
107 | state_dict[param] = pretrained_state_dict[pretrained_param_names[i]]
108 |
109 | # Convert fc6, fc7 to convolutional layers, and subsample (by decimation) to sizes of conv6 and conv7
110 | # fc6
111 | conv_fc6_weight = pretrained_state_dict['classifier.0.weight'].view(4096, 512, 7, 7) # (4096, 512, 7, 7)
112 | conv_fc6_bias = pretrained_state_dict['classifier.0.bias'] # (4096)
113 | state_dict['conv6.weight'] = decimate(conv_fc6_weight, m=[4, None, 3, 3]) # (1024, 512, 3, 3)
114 | state_dict['conv6.bias'] = decimate(conv_fc6_bias, m=[4]) # (1024)
115 | # fc7
116 | conv_fc7_weight = pretrained_state_dict['classifier.3.weight'].view(4096, 4096, 1, 1) # (4096, 4096, 1, 1)
117 | conv_fc7_bias = pretrained_state_dict['classifier.3.bias'] # (4096)
118 | state_dict['conv7.weight'] = decimate(conv_fc7_weight, m=[4, 4, None, None]) # (1024, 1024, 1, 1)
119 | state_dict['conv7.bias'] = decimate(conv_fc7_bias, m=[4]) # (1024)
120 |
121 | # Note: an FC layer of size (K) operating on a flattened version (C*H*W) of a 2D image of size (C, H, W)...
122 | # ...is equivalent to a convolutional layer with kernel size (H, W), input channels C, output channels K...
123 | # ...operating on the 2D image of size (C, H, W) without padding
124 |
125 | self.load_state_dict(state_dict)
126 |
127 | print("\nLoaded base model.\n")
128 |
129 |
130 | class AuxiliaryConvolutions(nn.Module):
131 | """
132 | Additional convolutions to produce higher-level feature maps.
133 | """
134 |
135 | def __init__(self):
136 | super(AuxiliaryConvolutions, self).__init__()
137 |
138 | # Auxiliary/additional convolutions on top of the VGG base
139 | self.conv8_1 = nn.Conv2d(1024, 256, kernel_size=1, padding=0) # stride = 1, by default
140 | self.conv8_2 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1) # dim. reduction because stride > 1
141 |
142 | self.conv9_1 = nn.Conv2d(512, 128, kernel_size=1, padding=0)
143 | self.conv9_2 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1) # dim. reduction because stride > 1
144 |
145 | self.conv10_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
146 | self.conv10_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0) # dim. reduction because padding = 0
147 |
148 | self.conv11_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
149 | self.conv11_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0) # dim. reduction because padding = 0
150 |
151 | # Initialize convolutions' parameters
152 | self.init_conv2d()
153 |
154 | def init_conv2d(self):
155 | """
156 | Initialize convolution parameters.
157 | """
158 | for c in self.children():
159 | if isinstance(c, nn.Conv2d):
160 | nn.init.xavier_uniform_(c.weight)
161 | nn.init.constant_(c.bias, 0.)
162 |
163 | def forward(self, conv7_feats):
164 | """
165 | Forward propagation.
166 |
167 | :param conv7_feats: lower-level conv7 feature map, a tensor of dimensions (N, 1024, 19, 19)
168 | :return: higher-level feature maps conv8_2, conv9_2, conv10_2, and conv11_2
169 | """
170 | out = F.relu(self.conv8_1(conv7_feats)) # (N, 256, 19, 19)
171 | out = F.relu(self.conv8_2(out)) # (N, 512, 10, 10)
172 | conv8_2_feats = out # (N, 512, 10, 10)
173 |
174 | out = F.relu(self.conv9_1(out)) # (N, 128, 10, 10)
175 | out = F.relu(self.conv9_2(out)) # (N, 256, 5, 5)
176 | conv9_2_feats = out # (N, 256, 5, 5)
177 |
178 | out = F.relu(self.conv10_1(out)) # (N, 128, 5, 5)
179 | out = F.relu(self.conv10_2(out)) # (N, 256, 3, 3)
180 | conv10_2_feats = out # (N, 256, 3, 3)
181 |
182 | out = F.relu(self.conv11_1(out)) # (N, 128, 3, 3)
183 | conv11_2_feats = F.relu(self.conv11_2(out)) # (N, 256, 1, 1)
184 |
185 | # Higher-level feature maps
186 | return conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats
187 |
188 |
189 | class PredictionConvolutions(nn.Module):
190 | """
191 | Convolutions to predict class scores and bounding boxes using lower and higher-level feature maps.
192 |
193 | The bounding boxes (locations) are predicted as encoded offsets w.r.t each of the 8732 prior (default) boxes.
194 | See 'cxcy_to_gcxgcy' in utils.py for the encoding definition.
195 |
196 | The class scores represent the scores of each object class in each of the 8732 bounding boxes located.
197 | A high score for 'background' = no object.
198 | """
199 |
200 | def __init__(self, n_classes):
201 | """
202 | :param n_classes: number of different types of objects
203 | """
204 | super(PredictionConvolutions, self).__init__()
205 |
206 | self.n_classes = n_classes
207 |
208 | # Number of prior-boxes we are considering per position in each feature map
209 | n_boxes = {'conv4_3': 4,
210 | 'conv7': 6,
211 | 'conv8_2': 6,
212 | 'conv9_2': 6,
213 | 'conv10_2': 4,
214 | 'conv11_2': 4}
215 | # 4 prior-boxes implies we use 4 different aspect ratios, etc.
216 |
217 | # Localization prediction convolutions (predict offsets w.r.t prior-boxes)
218 | self.loc_conv4_3 = nn.Conv2d(512, n_boxes['conv4_3'] * 4, kernel_size=3, padding=1)
219 | self.loc_conv7 = nn.Conv2d(1024, n_boxes['conv7'] * 4, kernel_size=3, padding=1)
220 | self.loc_conv8_2 = nn.Conv2d(512, n_boxes['conv8_2'] * 4, kernel_size=3, padding=1)
221 | self.loc_conv9_2 = nn.Conv2d(256, n_boxes['conv9_2'] * 4, kernel_size=3, padding=1)
222 | self.loc_conv10_2 = nn.Conv2d(256, n_boxes['conv10_2'] * 4, kernel_size=3, padding=1)
223 | self.loc_conv11_2 = nn.Conv2d(256, n_boxes['conv11_2'] * 4, kernel_size=3, padding=1)
224 |
225 | # Class prediction convolutions (predict classes in localization boxes)
226 | self.cl_conv4_3 = nn.Conv2d(512, n_boxes['conv4_3'] * n_classes, kernel_size=3, padding=1)
227 | self.cl_conv7 = nn.Conv2d(1024, n_boxes['conv7'] * n_classes, kernel_size=3, padding=1)
228 | self.cl_conv8_2 = nn.Conv2d(512, n_boxes['conv8_2'] * n_classes, kernel_size=3, padding=1)
229 | self.cl_conv9_2 = nn.Conv2d(256, n_boxes['conv9_2'] * n_classes, kernel_size=3, padding=1)
230 | self.cl_conv10_2 = nn.Conv2d(256, n_boxes['conv10_2'] * n_classes, kernel_size=3, padding=1)
231 | self.cl_conv11_2 = nn.Conv2d(256, n_boxes['conv11_2'] * n_classes, kernel_size=3, padding=1)
232 |
233 | # Initialize convolutions' parameters
234 | self.init_conv2d()
235 |
236 | def init_conv2d(self):
237 | """
238 | Initialize convolution parameters.
239 | """
240 | for c in self.children():
241 | if isinstance(c, nn.Conv2d):
242 | nn.init.xavier_uniform_(c.weight)
243 | nn.init.constant_(c.bias, 0.)
244 |
245 | def forward(self, conv4_3_feats, conv7_feats, conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats):
246 | """
247 | Forward propagation.
248 |
249 | :param conv4_3_feats: conv4_3 feature map, a tensor of dimensions (N, 512, 38, 38)
250 | :param conv7_feats: conv7 feature map, a tensor of dimensions (N, 1024, 19, 19)
251 | :param conv8_2_feats: conv8_2 feature map, a tensor of dimensions (N, 512, 10, 10)
252 | :param conv9_2_feats: conv9_2 feature map, a tensor of dimensions (N, 256, 5, 5)
253 | :param conv10_2_feats: conv10_2 feature map, a tensor of dimensions (N, 256, 3, 3)
254 | :param conv11_2_feats: conv11_2 feature map, a tensor of dimensions (N, 256, 1, 1)
255 | :return: 8732 locations and class scores (i.e. w.r.t each prior box) for each image
256 | """
257 | batch_size = conv4_3_feats.size(0)
258 |
259 | # Predict localization boxes' bounds (as offsets w.r.t prior-boxes)
260 | l_conv4_3 = self.loc_conv4_3(conv4_3_feats) # (N, 16, 38, 38)
261 | l_conv4_3 = l_conv4_3.permute(0, 2, 3,
262 | 1).contiguous() # (N, 38, 38, 16), to match prior-box order (after .view())
263 | # (.contiguous() ensures it is stored in a contiguous chunk of memory, needed for .view() below)
264 | l_conv4_3 = l_conv4_3.view(batch_size, -1, 4) # (N, 5776, 4), there are a total 5776 boxes on this feature map
265 |
266 | l_conv7 = self.loc_conv7(conv7_feats) # (N, 24, 19, 19)
267 | l_conv7 = l_conv7.permute(0, 2, 3, 1).contiguous() # (N, 19, 19, 24)
268 | l_conv7 = l_conv7.view(batch_size, -1, 4) # (N, 2166, 4), there are a total 2116 boxes on this feature map
269 |
270 | l_conv8_2 = self.loc_conv8_2(conv8_2_feats) # (N, 24, 10, 10)
271 | l_conv8_2 = l_conv8_2.permute(0, 2, 3, 1).contiguous() # (N, 10, 10, 24)
272 | l_conv8_2 = l_conv8_2.view(batch_size, -1, 4) # (N, 600, 4)
273 |
274 | l_conv9_2 = self.loc_conv9_2(conv9_2_feats) # (N, 24, 5, 5)
275 | l_conv9_2 = l_conv9_2.permute(0, 2, 3, 1).contiguous() # (N, 5, 5, 24)
276 | l_conv9_2 = l_conv9_2.view(batch_size, -1, 4) # (N, 150, 4)
277 |
278 | l_conv10_2 = self.loc_conv10_2(conv10_2_feats) # (N, 16, 3, 3)
279 | l_conv10_2 = l_conv10_2.permute(0, 2, 3, 1).contiguous() # (N, 3, 3, 16)
280 | l_conv10_2 = l_conv10_2.view(batch_size, -1, 4) # (N, 36, 4)
281 |
282 | l_conv11_2 = self.loc_conv11_2(conv11_2_feats) # (N, 16, 1, 1)
283 | l_conv11_2 = l_conv11_2.permute(0, 2, 3, 1).contiguous() # (N, 1, 1, 16)
284 | l_conv11_2 = l_conv11_2.view(batch_size, -1, 4) # (N, 4, 4)
285 |
286 | # Predict classes in localization boxes
287 | c_conv4_3 = self.cl_conv4_3(conv4_3_feats) # (N, 4 * n_classes, 38, 38)
288 | c_conv4_3 = c_conv4_3.permute(0, 2, 3,
289 | 1).contiguous() # (N, 38, 38, 4 * n_classes), to match prior-box order (after .view())
290 | c_conv4_3 = c_conv4_3.view(batch_size, -1,
291 | self.n_classes) # (N, 5776, n_classes), there are a total 5776 boxes on this feature map
292 |
293 | c_conv7 = self.cl_conv7(conv7_feats) # (N, 6 * n_classes, 19, 19)
294 | c_conv7 = c_conv7.permute(0, 2, 3, 1).contiguous() # (N, 19, 19, 6 * n_classes)
295 | c_conv7 = c_conv7.view(batch_size, -1,
296 | self.n_classes) # (N, 2166, n_classes), there are a total 2116 boxes on this feature map
297 |
298 | c_conv8_2 = self.cl_conv8_2(conv8_2_feats) # (N, 6 * n_classes, 10, 10)
299 | c_conv8_2 = c_conv8_2.permute(0, 2, 3, 1).contiguous() # (N, 10, 10, 6 * n_classes)
300 | c_conv8_2 = c_conv8_2.view(batch_size, -1, self.n_classes) # (N, 600, n_classes)
301 |
302 | c_conv9_2 = self.cl_conv9_2(conv9_2_feats) # (N, 6 * n_classes, 5, 5)
303 | c_conv9_2 = c_conv9_2.permute(0, 2, 3, 1).contiguous() # (N, 5, 5, 6 * n_classes)
304 | c_conv9_2 = c_conv9_2.view(batch_size, -1, self.n_classes) # (N, 150, n_classes)
305 |
306 | c_conv10_2 = self.cl_conv10_2(conv10_2_feats) # (N, 4 * n_classes, 3, 3)
307 | c_conv10_2 = c_conv10_2.permute(0, 2, 3, 1).contiguous() # (N, 3, 3, 4 * n_classes)
308 | c_conv10_2 = c_conv10_2.view(batch_size, -1, self.n_classes) # (N, 36, n_classes)
309 |
310 | c_conv11_2 = self.cl_conv11_2(conv11_2_feats) # (N, 4 * n_classes, 1, 1)
311 | c_conv11_2 = c_conv11_2.permute(0, 2, 3, 1).contiguous() # (N, 1, 1, 4 * n_classes)
312 | c_conv11_2 = c_conv11_2.view(batch_size, -1, self.n_classes) # (N, 4, n_classes)
313 |
314 | # A total of 8732 boxes
315 | # Concatenate in this specific order (i.e. must match the order of the prior-boxes)
316 | locs = torch.cat([l_conv4_3, l_conv7, l_conv8_2, l_conv9_2, l_conv10_2, l_conv11_2], dim=1) # (N, 8732, 4)
317 | classes_scores = torch.cat([c_conv4_3, c_conv7, c_conv8_2, c_conv9_2, c_conv10_2, c_conv11_2],
318 | dim=1) # (N, 8732, n_classes)
319 |
320 | return locs, classes_scores
321 |
322 |
323 | class SSD300(nn.Module):
324 | """
325 | The SSD300 network - encapsulates the base VGG network, auxiliary, and prediction convolutions.
326 | """
327 |
328 | def __init__(self, n_classes):
329 | super(SSD300, self).__init__()
330 |
331 | self.n_classes = n_classes
332 |
333 | self.base = VGGBase()
334 | self.aux_convs = AuxiliaryConvolutions()
335 | self.pred_convs = PredictionConvolutions(n_classes)
336 |
337 | # Since lower level features (conv4_3_feats) have considerably larger scales, we take the L2 norm and rescale
338 | # Rescale factor is initially set at 20, but is learned for each channel during back-prop
339 | self.rescale_factors = nn.Parameter(torch.FloatTensor(1, 512, 1, 1)) # there are 512 channels in conv4_3_feats
340 | nn.init.constant_(self.rescale_factors, 20)
341 |
342 | # Prior boxes
343 | self.priors_cxcy = self.create_prior_boxes()
344 |
345 | def forward(self, image):
346 | """
347 | Forward propagation.
348 |
349 | :param image: images, a tensor of dimensions (N, 3, 300, 300)
350 | :return: 8732 locations and class scores (i.e. w.r.t each prior box) for each image
351 | """
352 | # Run VGG base network convolutions (lower level feature map generators)
353 | conv4_3_feats, conv7_feats = self.base(image) # (N, 512, 38, 38), (N, 1024, 19, 19)
354 |
355 | # Rescale conv4_3 after L2 norm
356 | norm = conv4_3_feats.pow(2).sum(dim=1, keepdim=True).sqrt() # (N, 1, 38, 38)
357 | conv4_3_feats = conv4_3_feats / norm # (N, 512, 38, 38)
358 | conv4_3_feats = conv4_3_feats * self.rescale_factors # (N, 512, 38, 38)
359 | # (PyTorch autobroadcasts singleton dimensions during arithmetic)
360 |
361 | # Run auxiliary convolutions (higher level feature map generators)
362 | conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats = \
363 | self.aux_convs(conv7_feats) # (N, 512, 10, 10), (N, 256, 5, 5), (N, 256, 3, 3), (N, 256, 1, 1)
364 |
365 | # Run prediction convolutions (predict offsets w.r.t prior-boxes and classes in each resulting localization box)
366 | locs, classes_scores = self.pred_convs(conv4_3_feats, conv7_feats, conv8_2_feats, conv9_2_feats, conv10_2_feats,
367 | conv11_2_feats) # (N, 8732, 4), (N, 8732, n_classes)
368 |
369 | return locs, classes_scores
370 |
371 | def create_prior_boxes(self):
372 | """
373 | Create the 8732 prior (default) boxes for the SSD300, as defined in the paper.
374 |
375 | :return: prior boxes in center-size coordinates, a tensor of dimensions (8732, 4)
376 | """
377 | fmap_dims = {'conv4_3': 38,
378 | 'conv7': 19,
379 | 'conv8_2': 10,
380 | 'conv9_2': 5,
381 | 'conv10_2': 3,
382 | 'conv11_2': 1}
383 |
384 | obj_scales = {'conv4_3': 0.1,
385 | 'conv7': 0.2,
386 | 'conv8_2': 0.375,
387 | 'conv9_2': 0.55,
388 | 'conv10_2': 0.725,
389 | 'conv11_2': 0.9}
390 |
391 | aspect_ratios = {'conv4_3': [1., 2., 0.5],
392 | 'conv7': [1., 2., 3., 0.5, .333],
393 | 'conv8_2': [1., 2., 3., 0.5, .333],
394 | 'conv9_2': [1., 2., 3., 0.5, .333],
395 | 'conv10_2': [1., 2., 0.5],
396 | 'conv11_2': [1., 2., 0.5]}
397 |
398 | fmaps = list(fmap_dims.keys())
399 |
400 | prior_boxes = []
401 |
402 | for k, fmap in enumerate(fmaps):
403 | for i in range(fmap_dims[fmap]):
404 | for j in range(fmap_dims[fmap]):
405 | cx = (j + 0.5) / fmap_dims[fmap]
406 | cy = (i + 0.5) / fmap_dims[fmap]
407 |
408 | for ratio in aspect_ratios[fmap]:
409 | prior_boxes.append([cx, cy, obj_scales[fmap] * sqrt(ratio), obj_scales[fmap] / sqrt(ratio)])
410 |
411 | # For an aspect ratio of 1, use an additional prior whose scale is the geometric mean of the
412 | # scale of the current feature map and the scale of the next feature map
413 | if ratio == 1.:
414 | try:
415 | additional_scale = sqrt(obj_scales[fmap] * obj_scales[fmaps[k + 1]])
416 | # For the last feature map, there is no "next" feature map
417 | except IndexError:
418 | additional_scale = 1.
419 | prior_boxes.append([cx, cy, additional_scale, additional_scale])
420 |
421 | prior_boxes = torch.FloatTensor(prior_boxes).to(device) # (8732, 4)
422 | prior_boxes.clamp_(0, 1) # (8732, 4); this line has no effect; see Remarks section in tutorial
423 |
424 | return prior_boxes
425 |
426 | def detect_objects(self, predicted_locs, predicted_scores, min_score, max_overlap, top_k):
427 | """
428 | Decipher the 8732 locations and class scores (output of ths SSD300) to detect objects.
429 |
430 | For each class, perform Non-Maximum Suppression (NMS) on boxes that are above a minimum threshold.
431 |
432 | :param predicted_locs: predicted locations/boxes w.r.t the 8732 prior boxes, a tensor of dimensions (N, 8732, 4)
433 | :param predicted_scores: class scores for each of the encoded locations/boxes, a tensor of dimensions (N, 8732, n_classes)
434 | :param min_score: minimum threshold for a box to be considered a match for a certain class
435 | :param max_overlap: maximum overlap two boxes can have so that the one with the lower score is not suppressed via NMS
436 | :param top_k: if there are a lot of resulting detection across all classes, keep only the top 'k'
437 | :return: detections (boxes, labels, and scores), lists of length batch_size
438 | """
439 | batch_size = predicted_locs.size(0)
440 | n_priors = self.priors_cxcy.size(0)
441 | predicted_scores = F.softmax(predicted_scores, dim=2) # (N, 8732, n_classes)
442 |
443 | # Lists to store final predicted boxes, labels, and scores for all images
444 | all_images_boxes = list()
445 | all_images_labels = list()
446 | all_images_scores = list()
447 |
448 | assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)
449 |
450 | for i in range(batch_size):
451 | # Decode object coordinates from the form we regressed predicted boxes to
452 | decoded_locs = cxcy_to_xy(
453 | gcxgcy_to_cxcy(predicted_locs[i], self.priors_cxcy)) # (8732, 4), these are fractional pt. coordinates
454 |
455 | # Lists to store boxes and scores for this image
456 | image_boxes = list()
457 | image_labels = list()
458 | image_scores = list()
459 |
460 | max_scores, best_label = predicted_scores[i].max(dim=1) # (8732)
461 |
462 | # Check for each class
463 | for c in range(1, self.n_classes):
464 | # Keep only predicted boxes and scores where scores for this class are above the minimum score
465 | class_scores = predicted_scores[i][:, c] # (8732)
466 | score_above_min_score = class_scores > min_score # torch.uint8 (byte) tensor, for indexing
467 | n_above_min_score = score_above_min_score.sum().item()
468 | if n_above_min_score == 0:
469 | continue
470 | class_scores = class_scores[score_above_min_score] # (n_qualified), n_min_score <= 8732
471 | class_decoded_locs = decoded_locs[score_above_min_score] # (n_qualified, 4)
472 |
473 | # Sort predicted boxes and scores by scores
474 | class_scores, sort_ind = class_scores.sort(dim=0, descending=True) # (n_qualified), (n_min_score)
475 | class_decoded_locs = class_decoded_locs[sort_ind] # (n_min_score, 4)
476 |
477 | # Find the overlap between predicted boxes
478 | overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs) # (n_qualified, n_min_score)
479 |
480 | # Non-Maximum Suppression (NMS)
481 |
482 | # A torch.uint8 (byte) tensor to keep track of which predicted boxes to suppress
483 | # 1 implies suppress, 0 implies don't suppress
484 | suppress = torch.zeros((n_above_min_score), dtype=torch.uint8).to(device) # (n_qualified)
485 |
486 | # Consider each box in order of decreasing scores
487 | for box in range(class_decoded_locs.size(0)):
488 | # If this box is already marked for suppression
489 | if suppress[box] == 1:
490 | continue
491 |
492 | # Suppress boxes whose overlaps (with this box) are greater than maximum overlap
493 | # Find such boxes and update suppress indices
494 | suppress = torch.max(suppress, overlap[box] > max_overlap)
495 | # The max operation retains previously suppressed boxes, like an 'OR' operation
496 |
497 | # Don't suppress this box, even though it has an overlap of 1 with itself
498 | suppress[box] = 0
499 |
500 | # Store only unsuppressed boxes for this class
501 | image_boxes.append(class_decoded_locs[1 - suppress])
502 | image_labels.append(torch.LongTensor((1 - suppress).sum().item() * [c]).to(device))
503 | image_scores.append(class_scores[1 - suppress])
504 |
505 | # If no object in any class is found, store a placeholder for 'background'
506 | if len(image_boxes) == 0:
507 | image_boxes.append(torch.FloatTensor([[0., 0., 1., 1.]]).to(device))
508 | image_labels.append(torch.LongTensor([0]).to(device))
509 | image_scores.append(torch.FloatTensor([0.]).to(device))
510 |
511 | # Concatenate into single tensors
512 | image_boxes = torch.cat(image_boxes, dim=0) # (n_objects, 4)
513 | image_labels = torch.cat(image_labels, dim=0) # (n_objects)
514 | image_scores = torch.cat(image_scores, dim=0) # (n_objects)
515 | n_objects = image_scores.size(0)
516 |
517 | # Keep only the top k objects
518 | if n_objects > top_k:
519 | image_scores, sort_ind = image_scores.sort(dim=0, descending=True)
520 | image_scores = image_scores[:top_k] # (top_k)
521 | image_boxes = image_boxes[sort_ind][:top_k] # (top_k, 4)
522 | image_labels = image_labels[sort_ind][:top_k] # (top_k)
523 |
524 | # Append to lists that store predicted boxes and scores for all images
525 | all_images_boxes.append(image_boxes)
526 | all_images_labels.append(image_labels)
527 | all_images_scores.append(image_scores)
528 |
529 | return all_images_boxes, all_images_labels, all_images_scores # lists of length batch_size
530 |
531 |
532 | class MultiBoxLoss(nn.Module):
533 | """
534 | The MultiBox loss, a loss function for object detection.
535 |
536 | This is a combination of:
537 | (1) a localization loss for the predicted locations of the boxes, and
538 | (2) a confidence loss for the predicted class scores.
539 | """
540 |
541 | def __init__(self, priors_cxcy, threshold=0.5, neg_pos_ratio=3, alpha=1.):
542 | super(MultiBoxLoss, self).__init__()
543 | self.priors_cxcy = priors_cxcy
544 | self.priors_xy = cxcy_to_xy(priors_cxcy)
545 | self.threshold = threshold
546 | self.neg_pos_ratio = neg_pos_ratio
547 | self.alpha = alpha
548 |
549 | self.smooth_l1 = nn.L1Loss() # *smooth* L1 loss in the paper; see Remarks section in the tutorial
550 | self.cross_entropy = nn.CrossEntropyLoss(reduce=False)
551 |
552 | def forward(self, predicted_locs, predicted_scores, boxes, labels):
553 | """
554 | Forward propagation.
555 |
556 | :param predicted_locs: predicted locations/boxes w.r.t the 8732 prior boxes, a tensor of dimensions (N, 8732, 4)
557 | :param predicted_scores: class scores for each of the encoded locations/boxes, a tensor of dimensions (N, 8732, n_classes)
558 | :param boxes: true object bounding boxes in boundary coordinates, a list of N tensors
559 | :param labels: true object labels, a list of N tensors
560 | :return: multibox loss, a scalar
561 | """
562 | batch_size = predicted_locs.size(0)
563 | n_priors = self.priors_cxcy.size(0)
564 | n_classes = predicted_scores.size(2)
565 |
566 | assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)
567 |
568 | true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device) # (N, 8732, 4)
569 | true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device) # (N, 8732)
570 |
571 | # For each image
572 | for i in range(batch_size):
573 | n_objects = boxes[i].size(0)
574 |
575 | overlap = find_jaccard_overlap(boxes[i],
576 | self.priors_xy) # (n_objects, 8732)
577 |
578 | # For each prior, find the object that has the maximum overlap
579 | overlap_for_each_prior, object_for_each_prior = overlap.max(dim=0) # (8732)
580 |
581 | # We don't want a situation where an object is not represented in our positive (non-background) priors -
582 | # 1. An object might not be the best object for all priors, and is therefore not in object_for_each_prior.
583 | # 2. All priors with the object may be assigned as background based on the threshold (0.5).
584 |
585 | # To remedy this -
586 | # First, find the prior that has the maximum overlap for each object.
587 | _, prior_for_each_object = overlap.max(dim=1) # (N_o)
588 |
589 | # Then, assign each object to the corresponding maximum-overlap-prior. (This fixes 1.)
590 | object_for_each_prior[prior_for_each_object] = torch.LongTensor(range(n_objects)).to(device)
591 |
592 | # To ensure these priors qualify, artificially give them an overlap of greater than 0.5. (This fixes 2.)
593 | overlap_for_each_prior[prior_for_each_object] = 1.
594 |
595 | # Labels for each prior
596 | label_for_each_prior = labels[i][object_for_each_prior] # (8732)
597 | # Set priors whose overlaps with objects are less than the threshold to be background (no object)
598 | label_for_each_prior[overlap_for_each_prior < self.threshold] = 0 # (8732)
599 |
600 | # Store
601 | true_classes[i] = label_for_each_prior
602 |
603 | # Encode center-size object coordinates into the form we regressed predicted boxes to
604 | true_locs[i] = cxcy_to_gcxgcy(xy_to_cxcy(boxes[i][object_for_each_prior]), self.priors_cxcy) # (8732, 4)
605 |
606 | # Identify priors that are positive (object/non-background)
607 | positive_priors = true_classes != 0 # (N, 8732)
608 |
609 | # LOCALIZATION LOSS
610 |
611 | # Localization loss is computed only over positive (non-background) priors
612 | loc_loss = self.smooth_l1(predicted_locs[positive_priors], true_locs[positive_priors]) # (), scalar
613 |
614 | # Note: indexing with a torch.uint8 (byte) tensor flattens the tensor when indexing is across multiple dimensions (N & 8732)
615 | # So, if predicted_locs has the shape (N, 8732, 4), predicted_locs[positive_priors] will have (total positives, 4)
616 |
617 | # CONFIDENCE LOSS
618 |
619 | # Confidence loss is computed over positive priors and the most difficult (hardest) negative priors in each image
620 | # That is, FOR EACH IMAGE,
621 | # we will take the hardest (neg_pos_ratio * n_positives) negative priors, i.e where there is maximum loss
622 | # This is called Hard Negative Mining - it concentrates on hardest negatives in each image, and also minimizes pos/neg imbalance
623 |
624 | # Number of positive and hard-negative priors per image
625 | n_positives = positive_priors.sum(dim=1) # (N)
626 | n_hard_negatives = self.neg_pos_ratio * n_positives # (N)
627 |
628 | # First, find the loss for all priors
629 | conf_loss_all = self.cross_entropy(predicted_scores.view(-1, n_classes), true_classes.view(-1)) # (N * 8732)
630 | conf_loss_all = conf_loss_all.view(batch_size, n_priors) # (N, 8732)
631 |
632 | # We already know which priors are positive
633 | conf_loss_pos = conf_loss_all[positive_priors] # (sum(n_positives))
634 |
635 | # Next, find which priors are hard-negative
636 | # To do this, sort ONLY negative priors in each image in order of decreasing loss and take top n_hard_negatives
637 | conf_loss_neg = conf_loss_all.clone() # (N, 8732)
638 | conf_loss_neg[positive_priors] = 0. # (N, 8732), positive priors are ignored (never in top n_hard_negatives)
639 | conf_loss_neg, _ = conf_loss_neg.sort(dim=1, descending=True) # (N, 8732), sorted by decreasing hardness
640 | hardness_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(conf_loss_neg).to(device) # (N, 8732)
641 | hard_negatives = hardness_ranks < n_hard_negatives.unsqueeze(1) # (N, 8732)
642 | conf_loss_hard_neg = conf_loss_neg[hard_negatives] # (sum(n_hard_negatives))
643 |
644 | # As in the paper, averaged over positive priors only, although computed over both positive and hard-negative priors
645 | conf_loss = (conf_loss_hard_neg.sum() + conf_loss_pos.sum()) / n_positives.sum().float() # (), scalar
646 |
647 | # TOTAL LOSS
648 |
649 | return conf_loss + self.alpha * loc_loss
650 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | import time
2 | import torch.backends.cudnn as cudnn
3 | import torch.optim
4 | import torch.utils.data
5 | from model import SSD300, MultiBoxLoss
6 | from datasets import PascalVOCDataset
7 | from utils import *
8 |
9 | # Data parameters
10 | data_folder = './' # folder with data files
11 | keep_difficult = True # use objects considered difficult to detect?
12 |
13 | # Model parameters
14 | # Not too many here since the SSD300 has a very specific structure
15 | n_classes = len(label_map) # number of different types of objects
16 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
17 |
18 | # Learning parameters
19 | checkpoint = None # path to model checkpoint, None if none
20 | batch_size = 8 # batch size
21 | iterations = 120000 # number of iterations to train
22 | workers = 4 # number of workers for loading data in the DataLoader
23 | print_freq = 200 # print training status every __ batches
24 | lr = 1e-3 # learning rate
25 | decay_lr_at = [80000, 100000] # decay learning rate after these many iterations
26 | decay_lr_to = 0.1 # decay learning rate to this fraction of the existing learning rate
27 | momentum = 0.9 # momentum
28 | weight_decay = 5e-4 # weight decay
29 | grad_clip = None # clip if gradients are exploding, which may happen at larger batch sizes (sometimes at 32) - you will recognize it by a sorting error in the MuliBox loss calculation
30 |
31 | cudnn.benchmark = True
32 |
33 |
34 | def main():
35 | """
36 | Training.
37 | """
38 | global start_epoch, label_map, epoch, checkpoint, decay_lr_at
39 |
40 | # Initialize model or load checkpoint
41 | if checkpoint is None:
42 | start_epoch = 0
43 | model = SSD300(n_classes=n_classes)
44 | # Initialize the optimizer, with twice the default learning rate for biases, as in the original Caffe repo
45 | biases = list()
46 | not_biases = list()
47 | for param_name, param in model.named_parameters():
48 | if param.requires_grad:
49 | if param_name.endswith('.bias'):
50 | biases.append(param)
51 | else:
52 | not_biases.append(param)
53 | optimizer = torch.optim.SGD(params=[{'params': biases, 'lr': 2 * lr}, {'params': not_biases}],
54 | lr=lr, momentum=momentum, weight_decay=weight_decay)
55 |
56 | else:
57 | checkpoint = torch.load(checkpoint)
58 | start_epoch = checkpoint['epoch'] + 1
59 | print('\nLoaded checkpoint from epoch %d.\n' % start_epoch)
60 | model = checkpoint['model']
61 | optimizer = checkpoint['optimizer']
62 |
63 | # Move to default device
64 | model = model.to(device)
65 | criterion = MultiBoxLoss(priors_cxcy=model.priors_cxcy).to(device)
66 |
67 | # Custom dataloaders
68 | train_dataset = PascalVOCDataset(data_folder,
69 | split='train',
70 | keep_difficult=keep_difficult)
71 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
72 | collate_fn=train_dataset.collate_fn, num_workers=workers,
73 | pin_memory=True) # note that we're passing the collate function here
74 |
75 | # Calculate total number of epochs to train and the epochs to decay learning rate at (i.e. convert iterations to epochs)
76 | # To convert iterations to epochs, divide iterations by the number of iterations per epoch
77 | # The paper trains for 120,000 iterations with a batch size of 32, decays after 80,000 and 100,000 iterations
78 | epochs = iterations // (len(train_dataset) // 32)
79 | decay_lr_at = [it // (len(train_dataset) // 32) for it in decay_lr_at]
80 |
81 | # Epochs
82 | for epoch in range(start_epoch, epochs):
83 |
84 | # Decay learning rate at particular epochs
85 | if epoch in decay_lr_at:
86 | adjust_learning_rate(optimizer, decay_lr_to)
87 |
88 | # One epoch's training
89 | train(train_loader=train_loader,
90 | model=model,
91 | criterion=criterion,
92 | optimizer=optimizer,
93 | epoch=epoch)
94 |
95 | # Save checkpoint
96 | save_checkpoint(epoch, model, optimizer)
97 |
98 |
99 | def train(train_loader, model, criterion, optimizer, epoch):
100 | """
101 | One epoch's training.
102 |
103 | :param train_loader: DataLoader for training data
104 | :param model: model
105 | :param criterion: MultiBox loss
106 | :param optimizer: optimizer
107 | :param epoch: epoch number
108 | """
109 | model.train() # training mode enables dropout
110 |
111 | batch_time = AverageMeter() # forward prop. + back prop. time
112 | data_time = AverageMeter() # data loading time
113 | losses = AverageMeter() # loss
114 |
115 | start = time.time()
116 |
117 | # Batches
118 | for i, (images, boxes, labels, _) in enumerate(train_loader):
119 | data_time.update(time.time() - start)
120 |
121 | # Move to default device
122 | images = images.to(device) # (batch_size (N), 3, 300, 300)
123 | boxes = [b.to(device) for b in boxes]
124 | labels = [l.to(device) for l in labels]
125 |
126 | # Forward prop.
127 | predicted_locs, predicted_scores = model(images) # (N, 8732, 4), (N, 8732, n_classes)
128 |
129 | # Loss
130 | loss = criterion(predicted_locs, predicted_scores, boxes, labels) # scalar
131 |
132 | # Backward prop.
133 | optimizer.zero_grad()
134 | loss.backward()
135 |
136 | # Clip gradients, if necessary
137 | if grad_clip is not None:
138 | clip_gradient(optimizer, grad_clip)
139 |
140 | # Update model
141 | optimizer.step()
142 |
143 | losses.update(loss.item(), images.size(0))
144 | batch_time.update(time.time() - start)
145 |
146 | start = time.time()
147 |
148 | # Print status
149 | if i % print_freq == 0:
150 | print('Epoch: [{0}][{1}/{2}]\t'
151 | 'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
152 | 'Data Time {data_time.val:.3f} ({data_time.avg:.3f})\t'
153 | 'Loss {loss.val:.4f} ({loss.avg:.4f})\t'.format(epoch, i, len(train_loader),
154 | batch_time=batch_time,
155 | data_time=data_time, loss=losses))
156 | del predicted_locs, predicted_scores, images, boxes, labels # free some memory since their histories may be stored
157 |
158 |
159 | if __name__ == '__main__':
160 | main()
161 |
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import torch
4 | import random
5 | import xml.etree.ElementTree as ET
6 | import torchvision.transforms.functional as FT
7 |
8 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
9 |
10 | # Label map
11 | voc_labels = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable',
12 | 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor')
13 | label_map = {k: v + 1 for v, k in enumerate(voc_labels)}
14 | label_map['background'] = 0
15 | rev_label_map = {v: k for k, v in label_map.items()} # Inverse mapping
16 |
17 | # Color map for bounding boxes of detected objects from https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
18 | distinct_colors = ['#e6194b', '#3cb44b', '#ffe119', '#0082c8', '#f58231', '#911eb4', '#46f0f0', '#f032e6',
19 | '#d2f53c', '#fabebe', '#008080', '#000080', '#aa6e28', '#fffac8', '#800000', '#aaffc3', '#808000',
20 | '#ffd8b1', '#e6beff', '#808080', '#FFFFFF']
21 | label_color_map = {k: distinct_colors[i] for i, k in enumerate(label_map.keys())}
22 |
23 |
24 | def parse_annotation(annotation_path):
25 | tree = ET.parse(annotation_path)
26 | root = tree.getroot()
27 |
28 | boxes = list()
29 | labels = list()
30 | difficulties = list()
31 | for object in root.iter('object'):
32 |
33 | difficult = int(object.find('difficult').text == '1')
34 |
35 | label = object.find('name').text.lower().strip()
36 | if label not in label_map:
37 | continue
38 |
39 | bbox = object.find('bndbox')
40 | xmin = int(bbox.find('xmin').text) - 1
41 | ymin = int(bbox.find('ymin').text) - 1
42 | xmax = int(bbox.find('xmax').text) - 1
43 | ymax = int(bbox.find('ymax').text) - 1
44 |
45 | boxes.append([xmin, ymin, xmax, ymax])
46 | labels.append(label_map[label])
47 | difficulties.append(difficult)
48 |
49 | return {'boxes': boxes, 'labels': labels, 'difficulties': difficulties}
50 |
51 |
52 | def create_data_lists(voc07_path, voc12_path, output_folder):
53 | """
54 | Create lists of images, the bounding boxes and labels of the objects in these images, and save these to file.
55 |
56 | :param voc07_path: path to the 'VOC2007' folder
57 | :param voc12_path: path to the 'VOC2012' folder
58 | :param output_folder: folder where the JSONs must be saved
59 | """
60 | voc07_path = os.path.abspath(voc07_path)
61 | voc12_path = os.path.abspath(voc12_path)
62 |
63 | train_images = list()
64 | train_objects = list()
65 | n_objects = 0
66 |
67 | # Training data
68 | for path in [voc07_path, voc12_path]:
69 |
70 | # Find IDs of images in training data
71 | with open(os.path.join(path, 'ImageSets/Main/trainval.txt')) as f:
72 | ids = f.read().splitlines()
73 |
74 | for id in ids:
75 | # Parse annotation's XML file
76 | objects = parse_annotation(os.path.join(path, 'Annotations', id + '.xml'))
77 | if len(objects['boxes']) == 0:
78 | continue
79 | n_objects += len(objects)
80 | train_objects.append(objects)
81 | train_images.append(os.path.join(path, 'JPEGImages', id + '.jpg'))
82 |
83 | assert len(train_objects) == len(train_images)
84 |
85 | # Save to file
86 | with open(os.path.join(output_folder, 'TRAIN_images.json'), 'w') as j:
87 | json.dump(train_images, j)
88 | with open(os.path.join(output_folder, 'TRAIN_objects.json'), 'w') as j:
89 | json.dump(train_objects, j)
90 | with open(os.path.join(output_folder, 'label_map.json'), 'w') as j:
91 | json.dump(label_map, j) # save label map too
92 |
93 | print('\nThere are %d training images containing a total of %d objects. Files have been saved to %s.' % (
94 | len(train_images), n_objects, os.path.abspath(output_folder)))
95 |
96 | # Test data
97 | test_images = list()
98 | test_objects = list()
99 | n_objects = 0
100 |
101 | # Find IDs of images in the test data
102 | with open(os.path.join(voc07_path, 'ImageSets/Main/test.txt')) as f:
103 | ids = f.read().splitlines()
104 |
105 | for id in ids:
106 | # Parse annotation's XML file
107 | objects = parse_annotation(os.path.join(voc07_path, 'Annotations', id + '.xml'))
108 | if len(objects) == 0:
109 | continue
110 | test_objects.append(objects)
111 | n_objects += len(objects)
112 | test_images.append(os.path.join(voc07_path, 'JPEGImages', id + '.jpg'))
113 |
114 | assert len(test_objects) == len(test_images)
115 |
116 | # Save to file
117 | with open(os.path.join(output_folder, 'TEST_images.json'), 'w') as j:
118 | json.dump(test_images, j)
119 | with open(os.path.join(output_folder, 'TEST_objects.json'), 'w') as j:
120 | json.dump(test_objects, j)
121 |
122 | print('\nThere are %d test images containing a total of %d objects. Files have been saved to %s.' % (
123 | len(test_images), n_objects, os.path.abspath(output_folder)))
124 |
125 |
126 | def decimate(tensor, m):
127 | """
128 | Decimate a tensor by a factor 'm', i.e. downsample by keeping every 'm'th value.
129 |
130 | This is used when we convert FC layers to equivalent Convolutional layers, BUT of a smaller size.
131 |
132 | :param tensor: tensor to be decimated
133 | :param m: list of decimation factors for each dimension of the tensor; None if not to be decimated along a dimension
134 | :return: decimated tensor
135 | """
136 | assert tensor.dim() == len(m)
137 | for d in range(tensor.dim()):
138 | if m[d] is not None:
139 | tensor = tensor.index_select(dim=d,
140 | index=torch.arange(start=0, end=tensor.size(d), step=m[d]).long())
141 |
142 | return tensor
143 |
144 |
145 | def calculate_mAP(det_boxes, det_labels, det_scores, true_boxes, true_labels, true_difficulties):
146 | """
147 | Calculate the Mean Average Precision (mAP) of detected objects.
148 |
149 | See https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173 for an explanation
150 |
151 | :param det_boxes: list of tensors, one tensor for each image containing detected objects' bounding boxes
152 | :param det_labels: list of tensors, one tensor for each image containing detected objects' labels
153 | :param det_scores: list of tensors, one tensor for each image containing detected objects' labels' scores
154 | :param true_boxes: list of tensors, one tensor for each image containing actual objects' bounding boxes
155 | :param true_labels: list of tensors, one tensor for each image containing actual objects' labels
156 | :param true_difficulties: list of tensors, one tensor for each image containing actual objects' difficulty (0 or 1)
157 | :return: list of average precisions for all classes, mean average precision (mAP)
158 | """
159 | assert len(det_boxes) == len(det_labels) == len(det_scores) == len(true_boxes) == len(
160 | true_labels) == len(
161 | true_difficulties) # these are all lists of tensors of the same length, i.e. number of images
162 | n_classes = len(label_map)
163 |
164 | # Store all (true) objects in a single continuous tensor while keeping track of the image it is from
165 | true_images = list()
166 | for i in range(len(true_labels)):
167 | true_images.extend([i] * true_labels[i].size(0))
168 | true_images = torch.LongTensor(true_images).to(
169 | device) # (n_objects), n_objects is the total no. of objects across all images
170 | true_boxes = torch.cat(true_boxes, dim=0) # (n_objects, 4)
171 | true_labels = torch.cat(true_labels, dim=0) # (n_objects)
172 | true_difficulties = torch.cat(true_difficulties, dim=0) # (n_objects)
173 |
174 | assert true_images.size(0) == true_boxes.size(0) == true_labels.size(0)
175 |
176 | # Store all detections in a single continuous tensor while keeping track of the image it is from
177 | det_images = list()
178 | for i in range(len(det_labels)):
179 | det_images.extend([i] * det_labels[i].size(0))
180 | det_images = torch.LongTensor(det_images).to(device) # (n_detections)
181 | det_boxes = torch.cat(det_boxes, dim=0) # (n_detections, 4)
182 | det_labels = torch.cat(det_labels, dim=0) # (n_detections)
183 | det_scores = torch.cat(det_scores, dim=0) # (n_detections)
184 |
185 | assert det_images.size(0) == det_boxes.size(0) == det_labels.size(0) == det_scores.size(0)
186 |
187 | # Calculate APs for each class (except background)
188 | average_precisions = torch.zeros((n_classes - 1), dtype=torch.float) # (n_classes - 1)
189 | for c in range(1, n_classes):
190 | # Extract only objects with this class
191 | true_class_images = true_images[true_labels == c] # (n_class_objects)
192 | true_class_boxes = true_boxes[true_labels == c] # (n_class_objects, 4)
193 | true_class_difficulties = true_difficulties[true_labels == c] # (n_class_objects)
194 | n_easy_class_objects = (1 - true_class_difficulties).sum().item() # ignore difficult objects
195 |
196 | # Keep track of which true objects with this class have already been 'detected'
197 | # So far, none
198 | true_class_boxes_detected = torch.zeros((true_class_difficulties.size(0)), dtype=torch.uint8).to(
199 | device) # (n_class_objects)
200 |
201 | # Extract only detections with this class
202 | det_class_images = det_images[det_labels == c] # (n_class_detections)
203 | det_class_boxes = det_boxes[det_labels == c] # (n_class_detections, 4)
204 | det_class_scores = det_scores[det_labels == c] # (n_class_detections)
205 | n_class_detections = det_class_boxes.size(0)
206 | if n_class_detections == 0:
207 | continue
208 |
209 | # Sort detections in decreasing order of confidence/scores
210 | det_class_scores, sort_ind = torch.sort(det_class_scores, dim=0, descending=True) # (n_class_detections)
211 | det_class_images = det_class_images[sort_ind] # (n_class_detections)
212 | det_class_boxes = det_class_boxes[sort_ind] # (n_class_detections, 4)
213 |
214 | # In the order of decreasing scores, check if true or false positive
215 | true_positives = torch.zeros((n_class_detections), dtype=torch.float).to(device) # (n_class_detections)
216 | false_positives = torch.zeros((n_class_detections), dtype=torch.float).to(device) # (n_class_detections)
217 | for d in range(n_class_detections):
218 | this_detection_box = det_class_boxes[d].unsqueeze(0) # (1, 4)
219 | this_image = det_class_images[d] # (), scalar
220 |
221 | # Find objects in the same image with this class, their difficulties, and whether they have been detected before
222 | object_boxes = true_class_boxes[true_class_images == this_image] # (n_class_objects_in_img)
223 | object_difficulties = true_class_difficulties[true_class_images == this_image] # (n_class_objects_in_img)
224 | # If no such object in this image, then the detection is a false positive
225 | if object_boxes.size(0) == 0:
226 | false_positives[d] = 1
227 | continue
228 |
229 | # Find maximum overlap of this detection with objects in this image of this class
230 | overlaps = find_jaccard_overlap(this_detection_box, object_boxes) # (1, n_class_objects_in_img)
231 | max_overlap, ind = torch.max(overlaps.squeeze(0), dim=0) # (), () - scalars
232 |
233 | # 'ind' is the index of the object in these image-level tensors 'object_boxes', 'object_difficulties'
234 | # In the original class-level tensors 'true_class_boxes', etc., 'ind' corresponds to object with index...
235 | original_ind = torch.LongTensor(range(true_class_boxes.size(0)))[true_class_images == this_image][ind]
236 | # We need 'original_ind' to update 'true_class_boxes_detected'
237 |
238 | # If the maximum overlap is greater than the threshold of 0.5, it's a match
239 | if max_overlap.item() > 0.5:
240 | # If the object it matched with is 'difficult', ignore it
241 | if object_difficulties[ind] == 0:
242 | # If this object has already not been detected, it's a true positive
243 | if true_class_boxes_detected[original_ind] == 0:
244 | true_positives[d] = 1
245 | true_class_boxes_detected[original_ind] = 1 # this object has now been detected/accounted for
246 | # Otherwise, it's a false positive (since this object is already accounted for)
247 | else:
248 | false_positives[d] = 1
249 | # Otherwise, the detection occurs in a different location than the actual object, and is a false positive
250 | else:
251 | false_positives[d] = 1
252 |
253 | # Compute cumulative precision and recall at each detection in the order of decreasing scores
254 | cumul_true_positives = torch.cumsum(true_positives, dim=0) # (n_class_detections)
255 | cumul_false_positives = torch.cumsum(false_positives, dim=0) # (n_class_detections)
256 | cumul_precision = cumul_true_positives / (
257 | cumul_true_positives + cumul_false_positives + 1e-10) # (n_class_detections)
258 | cumul_recall = cumul_true_positives / n_easy_class_objects # (n_class_detections)
259 |
260 | # Find the mean of the maximum of the precisions corresponding to recalls above the threshold 't'
261 | recall_thresholds = torch.arange(start=0, end=1.1, step=.1).tolist() # (11)
262 | precisions = torch.zeros((len(recall_thresholds)), dtype=torch.float).to(device) # (11)
263 | for i, t in enumerate(recall_thresholds):
264 | recalls_above_t = cumul_recall >= t
265 | if recalls_above_t.any():
266 | precisions[i] = cumul_precision[recalls_above_t].max()
267 | else:
268 | precisions[i] = 0.
269 | average_precisions[c - 1] = precisions.mean() # c is in [1, n_classes - 1]
270 |
271 | # Calculate Mean Average Precision (mAP)
272 | mean_average_precision = average_precisions.mean().item()
273 |
274 | # Keep class-wise average precisions in a dictionary
275 | average_precisions = {rev_label_map[c + 1]: v for c, v in enumerate(average_precisions.tolist())}
276 |
277 | return average_precisions, mean_average_precision
278 |
279 |
280 | def xy_to_cxcy(xy):
281 | """
282 | Convert bounding boxes from boundary coordinates (x_min, y_min, x_max, y_max) to center-size coordinates (c_x, c_y, w, h).
283 |
284 | :param xy: bounding boxes in boundary coordinates, a tensor of size (n_boxes, 4)
285 | :return: bounding boxes in center-size coordinates, a tensor of size (n_boxes, 4)
286 | """
287 | return torch.cat([(xy[:, 2:] + xy[:, :2]) / 2, # c_x, c_y
288 | xy[:, 2:] - xy[:, :2]], 1) # w, h
289 |
290 |
291 | def cxcy_to_xy(cxcy):
292 | """
293 | Convert bounding boxes from center-size coordinates (c_x, c_y, w, h) to boundary coordinates (x_min, y_min, x_max, y_max).
294 |
295 | :param cxcy: bounding boxes in center-size coordinates, a tensor of size (n_boxes, 4)
296 | :return: bounding boxes in boundary coordinates, a tensor of size (n_boxes, 4)
297 | """
298 | return torch.cat([cxcy[:, :2] - (cxcy[:, 2:] / 2), # x_min, y_min
299 | cxcy[:, :2] + (cxcy[:, 2:] / 2)], 1) # x_max, y_max
300 |
301 |
302 | def cxcy_to_gcxgcy(cxcy, priors_cxcy):
303 | """
304 | Encode bounding boxes (that are in center-size form) w.r.t. the corresponding prior boxes (that are in center-size form).
305 |
306 | For the center coordinates, find the offset with respect to the prior box, and scale by the size of the prior box.
307 | For the size coordinates, scale by the size of the prior box, and convert to the log-space.
308 |
309 | In the model, we are predicting bounding box coordinates in this encoded form.
310 |
311 | :param cxcy: bounding boxes in center-size coordinates, a tensor of size (n_priors, 4)
312 | :param priors_cxcy: prior boxes with respect to which the encoding must be performed, a tensor of size (n_priors, 4)
313 | :return: encoded bounding boxes, a tensor of size (n_priors, 4)
314 | """
315 |
316 | # The 10 and 5 below are referred to as 'variances' in the original Caffe repo, completely empirical
317 | # They are for some sort of numerical conditioning, for 'scaling the localization gradient'
318 | # See https://github.com/weiliu89/caffe/issues/155
319 | return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:] / 10), # g_c_x, g_c_y
320 | torch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1) # g_w, g_h
321 |
322 |
323 | def gcxgcy_to_cxcy(gcxgcy, priors_cxcy):
324 | """
325 | Decode bounding box coordinates predicted by the model, since they are encoded in the form mentioned above.
326 |
327 | They are decoded into center-size coordinates.
328 |
329 | This is the inverse of the function above.
330 |
331 | :param gcxgcy: encoded bounding boxes, i.e. output of the model, a tensor of size (n_priors, 4)
332 | :param priors_cxcy: prior boxes with respect to which the encoding is defined, a tensor of size (n_priors, 4)
333 | :return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)
334 | """
335 |
336 | return torch.cat([gcxgcy[:, :2] * priors_cxcy[:, 2:] / 10 + priors_cxcy[:, :2], # c_x, c_y
337 | torch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], 1) # w, h
338 |
339 |
340 | def find_intersection(set_1, set_2):
341 | """
342 | Find the intersection of every box combination between two sets of boxes that are in boundary coordinates.
343 |
344 | :param set_1: set 1, a tensor of dimensions (n1, 4)
345 | :param set_2: set 2, a tensor of dimensions (n2, 4)
346 | :return: intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
347 | """
348 |
349 | # PyTorch auto-broadcasts singleton dimensions
350 | lower_bounds = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0)) # (n1, n2, 2)
351 | upper_bounds = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0)) # (n1, n2, 2)
352 | intersection_dims = torch.clamp(upper_bounds - lower_bounds, min=0) # (n1, n2, 2)
353 | return intersection_dims[:, :, 0] * intersection_dims[:, :, 1] # (n1, n2)
354 |
355 |
356 | def find_jaccard_overlap(set_1, set_2):
357 | """
358 | Find the Jaccard Overlap (IoU) of every box combination between two sets of boxes that are in boundary coordinates.
359 |
360 | :param set_1: set 1, a tensor of dimensions (n1, 4)
361 | :param set_2: set 2, a tensor of dimensions (n2, 4)
362 | :return: Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
363 | """
364 |
365 | # Find intersections
366 | intersection = find_intersection(set_1, set_2) # (n1, n2)
367 |
368 | # Find areas of each box in both sets
369 | areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1]) # (n1)
370 | areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1]) # (n2)
371 |
372 | # Find the union
373 | # PyTorch auto-broadcasts singleton dimensions
374 | union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection # (n1, n2)
375 |
376 | return intersection / union # (n1, n2)
377 |
378 |
379 | # Some augmentation functions below have been adapted from
380 | # From https://github.com/amdegroot/ssd.pytorch/blob/master/utils/augmentations.py
381 |
382 | def expand(image, boxes, filler):
383 | """
384 | Perform a zooming out operation by placing the image in a larger canvas of filler material.
385 |
386 | Helps to learn to detect smaller objects.
387 |
388 | :param image: image, a tensor of dimensions (3, original_h, original_w)
389 | :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
390 | :param filler: RBG values of the filler material, a list like [R, G, B]
391 | :return: expanded image, updated bounding box coordinates
392 | """
393 | # Calculate dimensions of proposed expanded (zoomed-out) image
394 | original_h = image.size(1)
395 | original_w = image.size(2)
396 | max_scale = 4
397 | scale = random.uniform(1, max_scale)
398 | new_h = int(scale * original_h)
399 | new_w = int(scale * original_w)
400 |
401 | # Create such an image with the filler
402 | filler = torch.FloatTensor(filler) # (3)
403 | new_image = torch.ones((3, new_h, new_w), dtype=torch.float) * filler.unsqueeze(1).unsqueeze(1) # (3, new_h, new_w)
404 | # Note - do not use expand() like new_image = filler.unsqueeze(1).unsqueeze(1).expand(3, new_h, new_w)
405 | # because all expanded values will share the same memory, so changing one pixel will change all
406 |
407 | # Place the original image at random coordinates in this new image (origin at top-left of image)
408 | left = random.randint(0, new_w - original_w)
409 | right = left + original_w
410 | top = random.randint(0, new_h - original_h)
411 | bottom = top + original_h
412 | new_image[:, top:bottom, left:right] = image
413 |
414 | # Adjust bounding boxes' coordinates accordingly
415 | new_boxes = boxes + torch.FloatTensor([left, top, left, top]).unsqueeze(
416 | 0) # (n_objects, 4), n_objects is the no. of objects in this image
417 |
418 | return new_image, new_boxes
419 |
420 |
421 | def random_crop(image, boxes, labels, difficulties):
422 | """
423 | Performs a random crop in the manner stated in the paper. Helps to learn to detect larger and partial objects.
424 |
425 | Note that some objects may be cut out entirely.
426 |
427 | Adapted from https://github.com/amdegroot/ssd.pytorch/blob/master/utils/augmentations.py
428 |
429 | :param image: image, a tensor of dimensions (3, original_h, original_w)
430 | :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
431 | :param labels: labels of objects, a tensor of dimensions (n_objects)
432 | :param difficulties: difficulties of detection of these objects, a tensor of dimensions (n_objects)
433 | :return: cropped image, updated bounding box coordinates, updated labels, updated difficulties
434 | """
435 | original_h = image.size(1)
436 | original_w = image.size(2)
437 | # Keep choosing a minimum overlap until a successful crop is made
438 | while True:
439 | # Randomly draw the value for minimum overlap
440 | min_overlap = random.choice([0., .1, .3, .5, .7, .9, None]) # 'None' refers to no cropping
441 |
442 | # If not cropping
443 | if min_overlap is None:
444 | return image, boxes, labels, difficulties
445 |
446 | # Try up to 50 times for this choice of minimum overlap
447 | # This isn't mentioned in the paper, of course, but 50 is chosen in paper authors' original Caffe repo
448 | max_trials = 50
449 | for _ in range(max_trials):
450 | # Crop dimensions must be in [0.3, 1] of original dimensions
451 | # Note - it's [0.1, 1] in the paper, but actually [0.3, 1] in the authors' repo
452 | min_scale = 0.3
453 | scale_h = random.uniform(min_scale, 1)
454 | scale_w = random.uniform(min_scale, 1)
455 | new_h = int(scale_h * original_h)
456 | new_w = int(scale_w * original_w)
457 |
458 | # Aspect ratio has to be in [0.5, 2]
459 | aspect_ratio = new_h / new_w
460 | if not 0.5 < aspect_ratio < 2:
461 | continue
462 |
463 | # Crop coordinates (origin at top-left of image)
464 | left = random.randint(0, original_w - new_w)
465 | right = left + new_w
466 | top = random.randint(0, original_h - new_h)
467 | bottom = top + new_h
468 | crop = torch.FloatTensor([left, top, right, bottom]) # (4)
469 |
470 | # Calculate Jaccard overlap between the crop and the bounding boxes
471 | overlap = find_jaccard_overlap(crop.unsqueeze(0),
472 | boxes) # (1, n_objects), n_objects is the no. of objects in this image
473 | overlap = overlap.squeeze(0) # (n_objects)
474 |
475 | # If not a single bounding box has a Jaccard overlap of greater than the minimum, try again
476 | if overlap.max().item() < min_overlap:
477 | continue
478 |
479 | # Crop image
480 | new_image = image[:, top:bottom, left:right] # (3, new_h, new_w)
481 |
482 | # Find centers of original bounding boxes
483 | bb_centers = (boxes[:, :2] + boxes[:, 2:]) / 2. # (n_objects, 2)
484 |
485 | # Find bounding boxes whose centers are in the crop
486 | centers_in_crop = (bb_centers[:, 0] > left) * (bb_centers[:, 0] < right) * (bb_centers[:, 1] > top) * (
487 | bb_centers[:, 1] < bottom) # (n_objects), a Torch uInt8/Byte tensor, can be used as a boolean index
488 |
489 | # If not a single bounding box has its center in the crop, try again
490 | if not centers_in_crop.any():
491 | continue
492 |
493 | # Discard bounding boxes that don't meet this criterion
494 | new_boxes = boxes[centers_in_crop, :]
495 | new_labels = labels[centers_in_crop]
496 | new_difficulties = difficulties[centers_in_crop]
497 |
498 | # Calculate bounding boxes' new coordinates in the crop
499 | new_boxes[:, :2] = torch.max(new_boxes[:, :2], crop[:2]) # crop[:2] is [left, top]
500 | new_boxes[:, :2] -= crop[:2]
501 | new_boxes[:, 2:] = torch.min(new_boxes[:, 2:], crop[2:]) # crop[2:] is [right, bottom]
502 | new_boxes[:, 2:] -= crop[:2]
503 |
504 | return new_image, new_boxes, new_labels, new_difficulties
505 |
506 |
507 | def flip(image, boxes):
508 | """
509 | Flip image horizontally.
510 |
511 | :param image: image, a PIL Image
512 | :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
513 | :return: flipped image, updated bounding box coordinates
514 | """
515 | # Flip image
516 | new_image = FT.hflip(image)
517 |
518 | # Flip boxes
519 | new_boxes = boxes
520 | new_boxes[:, 0] = image.width - boxes[:, 0] - 1
521 | new_boxes[:, 2] = image.width - boxes[:, 2] - 1
522 | new_boxes = new_boxes[:, [2, 1, 0, 3]]
523 |
524 | return new_image, new_boxes
525 |
526 |
527 | def resize(image, boxes, dims=(300, 300), return_percent_coords=True):
528 | """
529 | Resize image. For the SSD300, resize to (300, 300).
530 |
531 | Since percent/fractional coordinates are calculated for the bounding boxes (w.r.t image dimensions) in this process,
532 | you may choose to retain them.
533 |
534 | :param image: image, a PIL Image
535 | :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
536 | :return: resized image, updated bounding box coordinates (or fractional coordinates, in which case they remain the same)
537 | """
538 | # Resize image
539 | new_image = FT.resize(image, dims)
540 |
541 | # Resize bounding boxes
542 | old_dims = torch.FloatTensor([image.width, image.height, image.width, image.height]).unsqueeze(0)
543 | new_boxes = boxes / old_dims # percent coordinates
544 |
545 | if not return_percent_coords:
546 | new_dims = torch.FloatTensor([dims[1], dims[0], dims[1], dims[0]]).unsqueeze(0)
547 | new_boxes = new_boxes * new_dims
548 |
549 | return new_image, new_boxes
550 |
551 |
552 | def photometric_distort(image):
553 | """
554 | Distort brightness, contrast, saturation, and hue, each with a 50% chance, in random order.
555 |
556 | :param image: image, a PIL Image
557 | :return: distorted image
558 | """
559 | new_image = image
560 |
561 | distortions = [FT.adjust_brightness,
562 | FT.adjust_contrast,
563 | FT.adjust_saturation,
564 | FT.adjust_hue]
565 |
566 | random.shuffle(distortions)
567 |
568 | for d in distortions:
569 | if random.random() < 0.5:
570 | if d.__name__ is 'adjust_hue':
571 | # Caffe repo uses a 'hue_delta' of 18 - we divide by 255 because PyTorch needs a normalized value
572 | adjust_factor = random.uniform(-18 / 255., 18 / 255.)
573 | else:
574 | # Caffe repo uses 'lower' and 'upper' values of 0.5 and 1.5 for brightness, contrast, and saturation
575 | adjust_factor = random.uniform(0.5, 1.5)
576 |
577 | # Apply this distortion
578 | new_image = d(new_image, adjust_factor)
579 |
580 | return new_image
581 |
582 |
583 | def transform(image, boxes, labels, difficulties, split):
584 | """
585 | Apply the transformations above.
586 |
587 | :param image: image, a PIL Image
588 | :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
589 | :param labels: labels of objects, a tensor of dimensions (n_objects)
590 | :param difficulties: difficulties of detection of these objects, a tensor of dimensions (n_objects)
591 | :param split: one of 'TRAIN' or 'TEST', since different sets of transformations are applied
592 | :return: transformed image, transformed bounding box coordinates, transformed labels, transformed difficulties
593 | """
594 | assert split in {'TRAIN', 'TEST'}
595 |
596 | # Mean and standard deviation of ImageNet data that our base VGG from torchvision was trained on
597 | # see: https://pytorch.org/docs/stable/torchvision/models.html
598 | mean = [0.485, 0.456, 0.406]
599 | std = [0.229, 0.224, 0.225]
600 |
601 | new_image = image
602 | new_boxes = boxes
603 | new_labels = labels
604 | new_difficulties = difficulties
605 | # Skip the following operations for evaluation/testing
606 | if split == 'TRAIN':
607 | # A series of photometric distortions in random order, each with 50% chance of occurrence, as in Caffe repo
608 | new_image = photometric_distort(new_image)
609 |
610 | # Convert PIL image to Torch tensor
611 | new_image = FT.to_tensor(new_image)
612 |
613 | # Expand image (zoom out) with a 50% chance - helpful for training detection of small objects
614 | # Fill surrounding space with the mean of ImageNet data that our base VGG was trained on
615 | if random.random() < 0.5:
616 | new_image, new_boxes = expand(new_image, boxes, filler=mean)
617 |
618 | # Randomly crop image (zoom in)
619 | new_image, new_boxes, new_labels, new_difficulties = random_crop(new_image, new_boxes, new_labels,
620 | new_difficulties)
621 |
622 | # Convert Torch tensor to PIL image
623 | new_image = FT.to_pil_image(new_image)
624 |
625 | # Flip image with a 50% chance
626 | if random.random() < 0.5:
627 | new_image, new_boxes = flip(new_image, new_boxes)
628 |
629 | # Resize image to (300, 300) - this also converts absolute boundary coordinates to their fractional form
630 | new_image, new_boxes = resize(new_image, new_boxes, dims=(300, 300))
631 |
632 | # Convert PIL image to Torch tensor
633 | new_image = FT.to_tensor(new_image)
634 |
635 | # Normalize by mean and standard deviation of ImageNet data that our base VGG was trained on
636 | new_image = FT.normalize(new_image, mean=mean, std=std)
637 |
638 | return new_image, new_boxes, new_labels, new_difficulties
639 |
640 |
641 | def adjust_learning_rate(optimizer, scale):
642 | """
643 | Scale learning rate by a specified factor.
644 |
645 | :param optimizer: optimizer whose learning rate must be shrunk.
646 | :param scale: factor to multiply learning rate with.
647 | """
648 | for param_group in optimizer.param_groups:
649 | param_group['lr'] = param_group['lr'] * scale
650 | print("DECAYING learning rate.\n The new LR is %f\n" % (optimizer.param_groups[1]['lr'],))
651 |
652 |
653 | def accuracy(scores, targets, k):
654 | """
655 | Computes top-k accuracy, from predicted and true labels.
656 |
657 | :param scores: scores from the model
658 | :param targets: true labels
659 | :param k: k in top-k accuracy
660 | :return: top-k accuracy
661 | """
662 | batch_size = targets.size(0)
663 | _, ind = scores.topk(k, 1, True, True)
664 | correct = ind.eq(targets.view(-1, 1).expand_as(ind))
665 | correct_total = correct.view(-1).float().sum() # 0D tensor
666 | return correct_total.item() * (100.0 / batch_size)
667 |
668 |
669 | def save_checkpoint(epoch, model, optimizer):
670 | """
671 | Save model checkpoint.
672 |
673 | :param epoch: epoch number
674 | :param model: model
675 | :param optimizer: optimizer
676 | """
677 | state = {'epoch': epoch,
678 | 'model': model,
679 | 'optimizer': optimizer}
680 | filename = 'checkpoint_ssd300.pth.tar'
681 | torch.save(state, filename)
682 |
683 |
684 | class AverageMeter(object):
685 | """
686 | Keeps track of most recent, average, sum, and count of a metric.
687 | """
688 |
689 | def __init__(self):
690 | self.reset()
691 |
692 | def reset(self):
693 | self.val = 0
694 | self.avg = 0
695 | self.sum = 0
696 | self.count = 0
697 |
698 | def update(self, val, n=1):
699 | self.val = val
700 | self.sum += val * n
701 | self.count += n
702 | self.avg = self.sum / self.count
703 |
704 |
705 | def clip_gradient(optimizer, grad_clip):
706 | """
707 | Clips gradients computed during backpropagation to avoid explosion of gradients.
708 |
709 | :param optimizer: optimizer with the gradients to be clipped
710 | :param grad_clip: clip value
711 | """
712 | for group in optimizer.param_groups:
713 | for param in group['params']:
714 | if param.grad is not None:
715 | param.grad.data.clamp_(-grad_clip, grad_clip)
716 |
--------------------------------------------------------------------------------