├── README.md
├── pascal_voc2012_ssd.ipynb
└── pascal_voc_2012.ipynb


/README.md:
--------------------------------------------------------------------------------
  1 | # 搞定目标检测（SSD篇）
  2 | 
  3 | ![](https://upload-images.jianshu.io/upload_images/13575947-08e4cd04dd185415.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
  4 | 
  5 | 通过对[搞定目标检测（SSD篇）（上）](https://www.jianshu.com/p/8d894605bb06)的学习，你应该已经了解目标检测的基本原理和技术局限性，本文将会详解如何实现SSD目标检测模型。先打个预防针，本文的内容会比较烧脑，而且默认你已经掌握了[上集](https://www.jianshu.com/p/8d894605bb06)的内容，当然我也会用平实的语言尽力给你讲清楚。Github: [https://github.com/alexshuang/pascal-voc-pytorch](https://github.com/alexshuang/pascal-voc-pytorch)。
  6 | 
  7 | ## SSD / [Paper](https://arxiv.org/abs/1512.02325) / [Notebook](https://github.com/alexshuang/pascal-voc-pytorch/blob/master/pascal_voc2012_ssd.ipynb)
  8 | 
  9 | 从SSD的全称，Single Shot MultiBox Detector，就可以窥探算法的本质：“Single Shot”指的是单目标检测，“MultiBox”中的“Box”就像是我们平时拍摄时用到的取景框，只关注框内的画面，屏蔽框外的内容。创建“Multi”个"Box"，将每个"Box"的单目标检测结果汇总起来就是多目标检测。换句话说，SSD将图像切分为N片，并对每片进行独立的单目标检测，最后汇总每片的检测结果。
 10 | 
 11 | ![Figure 1: SSD arch](https://upload-images.jianshu.io/upload_images/13575947-47dbeb80969ad6ac.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
 12 | 
 13 | SSD切分图像的方法就是Convolution，或者说是Receptive Field。如架构图所示，SSD的top layers（extra feature layers）是卷积层。假设卷积计算的结果是：[64, 25, 4, 4]，它指的是在4x4大小的feature map中，每个grid cell代表了原始图像中的一个区域。换句话说，如果用4x4的网格平铺整个图像，feature map中的每个grid cell对应一个网格区域。
 14 | 
 15 | ![Figure 2: 4x4 grid cells](https://upload-images.jianshu.io/upload_images/13575947-4eebfe8d0f62cc91.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
 16 | 
 17 | Figure 2中的网格就是MultiBox，每个Box的单目标检测的结果就保存在卷积运算的channels维度。对于矩阵 [64, 25, 4, 4]，25 = bounding box + 分类概率（额外增加"Background"分类） = 4 + 21。
 18 | 
 19 | 如Figure 1所示，SSD通过pooling层（or stride）不断调整网格的数量，例如从4x4 -> 2x2 -> 1x1，并将所有结果汇总起来，这样就可以使用不同大小的Box来锚定不同大小的物体。
 20 | 
 21 | ## Classification
 22 | 
 23 | 延续上集的思路，我将多目标检测也分解为分类（Classification）和定位（Location）两个独立操作。相比单目标检测，多目标检测模型最终用sigmoid()而不是softmax()来生成分类的概率。为检验Classification模型的预测准确率，我选取所有概率大于thresh（0.4）的分类，可以看到，Classification模型是work的（降低thresh可以显示更多的分类）。
 24 | 
 25 | ![](https://upload-images.jianshu.io/upload_images/13575947-ec8c0b3fa708ba48.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
 26 | 
 27 | ## Ground Truth （Location）
 28 | 
 29 | Ground Truth指的是图像的标注信息，如classification、bounding box、segmentation等信息。
 30 | 
 31 | ```
 32 | i = 9
 33 | bb = y[0][i].view(-1, 4)
 34 | clas = y[1][i]
 35 | bb, clas, bb.shape, clas.shape
 36 | 
 37 | (tensor([[  0.,   0.,   0.,   0.],
 38 |          [  0.,   0.,   0.,   0.],
 39 |          [  0.,   0.,   0.,   0.],
 40 |          [105.,   4., 161.,  28.],
 41 |          [ 70.,   0., 149.,  66.],
 42 |          [ 50.,  24., 185., 129.],
 43 |          [ 19.,  60., 223., 222.]], device='cuda:0'),
 44 |  tensor([ 0,  0,  0,  4, 14, 14, 14], device='cuda:0'),
 45 |  torch.Size([7, 4]),
 46 |  torch.Size([7]))
 47 | ```
 48 | 
 49 | 由于每个训练样本的ground truth个数不同，为了保证mini-batch矩阵的一致性，Pytorch会用0来填充y矩阵，因此，在使用数据时，需要先剔除bounding box全0的ground truth。
 50 | 
 51 | ```
 52 | i = 9
 53 | fig, ax = plt.subplots(figsize=(6, 4))
 54 | ax.imshow(ima[i])
 55 | draw_gt(ax, y[0][i].view(-1, 4), y[1][i], num_classes=len(labels))
 56 | ax.axis('off')
 57 | ```
 58 | 
 59 | ![](https://upload-images.jianshu.io/upload_images/13575947-6cb5940151cbb60e.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
 60 | 
 61 | ---
 62 | 
 63 | ## SSD Network Part 1
 64 | 
 65 | ```
 66 | def conv_layer(nin, nf, stride=2, drop=0.1):
 67 |   return nn.Sequential(
 68 |       nn.Conv2d(nin, nf, 3, stride, 1, bias=False),
 69 |       nn.ReLU(),
 70 |       nn.BatchNorm2d(nf),
 71 |       nn.Dropout(drop)
 72 |   )
 73 | 
 74 | class Outlayer(nn.Module):
 75 |   def __init__(self, nf, num_classes, bias):
 76 |     super().__init__()
 77 |     self.clas_conv = nn.Conv2d(nf, num_classes + 1, 3, 1, 1)
 78 |     self.bb_conv = nn.Conv2d(nf, 4, 3, 1, 1)
 79 |     self.clas_conv.bias.data.zero_().add_(bias)
 80 |     
 81 |   def flatten(self, x):
 82 |     bs, nf, w, h = x.size()
 83 |     x = x.permute(0, 2, 3, 1).contiguous()
 84 |     return x.view(bs, -1, nf)
 85 |   
 86 |   def forward(self, x):
 87 |     return [self.flatten(self.bb_conv(x)), self.flatten(self.clas_conv(x))]
 88 | 
 89 | class SSDHead(nn.Module):
 90 |   def __init__(self, num_classes, nf, bias, drop_i=0.25):
 91 |     super().__init__()
 92 |     self.conv1 = conv_layer(512, nf, stride=1)
 93 |     self.conv2 = conv_layer(nf, nf)
 94 |     self.drop_i = nn.Dropout(drop_i)
 95 |     self.out = Outlayer(nf, num_classes, bias=bias)
 96 |   
 97 |   def forward(self, x):
 98 |     x = self.drop_i(F.relu(x))
 99 |     x = self.conv1(x)
100 |     x = self.conv2(x)
101 |     return self.out(x)
102 |   
103 | ssd_head_f = SSDHead(num_classes, nf, bias=-3.)
104 | ```
105 | 
106 | 我使用的backbone是Resnet34，它最终的输出结果是7x7x512，经过stride=2的conv2()后，得到如Figure 2所示的4x4 freature map。通过Outlayer()生成channels分别为4和21的输出，前者与bounding box相关，后者是分类概率。之所以将clas_conv层的bias初始化为-3，是因为模型输出的总loss值偏大，虽然可以通过后续训练降低loss值，但模型却达不到期望效果，通过初始化赋值可以解决这个问题。
107 | 
108 | > 为什么是“与bounding box有关”，而不是bounding box？
109 | [搞定目标检测（SSD篇）（上）](https://www.jianshu.com/p/8d894605bb06)已经提到，Resnet这类图像识别模型并不擅长生成空间数据，因此SSD生成的并不是bounding box，而是bounding box相对于default box的偏移（offset），而default box则是预先定义好的bounding box，如Figure 2中的网格。
110 | 
111 | ## Default Box
112 | 
113 | Default Box就是“MultiBox”，是SSD的取景框，即Figure 2中的网格。它由[中心x、y坐标，width，height]组成。
114 | 
115 | ```
116 | cells = 4
117 | width = 1 / cells
118 | cx = np.repeat(np.linspace(width / 2, 1 - (width / 2), cells), cells)
119 | cy = np.tile(np.linspace(width / 2, 1 - (width / 2), cells), cells)
120 | w = h = np.array([width] * cells**2)
121 | def_box = T(np.stack([cx, cy, w, h], 1))
122 | def_box
123 | 
124 | tensor([[0.1250, 0.1250, 0.2500, 0.2500],
125 |         [0.1250, 0.3750, 0.2500, 0.2500],
126 |         [0.1250, 0.6250, 0.2500, 0.2500],
127 |         [0.1250, 0.8750, 0.2500, 0.2500],
128 |         [0.3750, 0.1250, 0.2500, 0.2500],
129 |         [0.3750, 0.3750, 0.2500, 0.2500],
130 |         [0.3750, 0.6250, 0.2500, 0.2500],
131 |         [0.3750, 0.8750, 0.2500, 0.2500],
132 |         [0.6250, 0.1250, 0.2500, 0.2500],
133 |         [0.6250, 0.3750, 0.2500, 0.2500],
134 |         [0.6250, 0.6250, 0.2500, 0.2500],
135 |         [0.6250, 0.8750, 0.2500, 0.2500],
136 |         [0.8750, 0.1250, 0.2500, 0.2500],
137 |         [0.8750, 0.3750, 0.2500, 0.2500],
138 |         [0.8750, 0.6250, 0.2500, 0.2500],
139 |         [0.8750, 0.8750, 0.2500, 0.2500]], device='cuda:0')
140 | ```
141 | 
142 | > 你是否注意到Figure 2图中的很多网格被识别为background，是模型错了？
143 | Figure 2是default boxes和ground truth相互匹配后得到的结果，实际上，因为没有和ground truth大小相似的default box，因此只能选择最适配的default box，但因为两者大小相差悬殊，所以才产生了错配的感觉。
144 | 
145 | ## Jaccard Index
146 | 
147 | default box和ground truth是通过jaccard index相互匹配的。通过jaccard()计算每个default box和每个ground truth的交并比 -- overlap，那些overlap > 0.5的default box index，就是jaccard index。通过jaccard index，可以知道default box对应哪个ground truth。
148 | 
149 | ![](https://upload-images.jianshu.io/upload_images/13575947-73fcb1fd4e5659c2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
150 | 
151 | ```
152 | def box_size(box): return (box[:, 2] - box[:, 0]) * (box[:, 3] - box[:, 1])
153 | 
154 | def intersection(gt, def_box):
155 |   left_top = torch.max(gt[:, None, :2], def_box[None, :, :2])
156 |   right_bottom = torch.min(gt[:, None, 2:], def_box[None, :, 2:])
157 |   wh = torch.clamp(right_bottom - left_top, min=0)
158 |   return wh[:, :, 0] * wh[:, :, 1]
159 |   
160 | def jaccard(gt, def_box):
161 |   inter = intersection(gt, def_box)
162 |   union = box_size(gt).unsqueeze(1) + box_size(def_box).unsqueeze(0) - inter
163 |   return inter / union
164 | 
165 | overlap = jaccard(bb, def_box_bb * sz)
166 | gt_best_overlap, gt_db_idx = overlap.max(1)
167 | db_best_overlap, db_gt_idx = overlap.max(0)
168 | db_best_overlap[gt_db_idx] = 1.1
169 | is_obj = db_best_overlap > 0.5
170 | pos_idxs = np.nonzero(is_obj)[:, 0]
171 | neg_idxs = np.nonzero(1 - is_obj)[:, 0]
172 | db_clas = T([num_classes] * len(db_best_overlap))
173 | db_clas[pos_idxs] = clas[db_gt_idx[pos_idxs]]
174 | db_best_overlap, db_clas
175 | ```
176 | 
177 | db_gt_idx指的是，每个default box对应的ground truth id。db_best_overlap是指每个default box内最大的jaccard，jaccard最大的default_box也不一定都满足> 0.5的要求（如Figure 2），所以主动将ground truth所对应的default_box的overlap提升为1.1。db_clas就是jaccard index。
178 | 
179 | ## More Default Boxes
180 | 
181 | 还记得SSD的架构吗，随着extra feature layers的深入，feature map的网格越来越大，从4x4->2x2->1x1，也就是说，它可以匹配更多体型的物体。除此之外，SSD还会利用不同的宽纵比，创建大小相同但形状不同的default box：
182 | 
183 | ![](https://upload-images.jianshu.io/upload_images/13575947-2cac943e12e99d62.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
184 | 
185 | 如图所示，每个default box可以分为3大类：宽比高长、高比宽长、等长，所以我采用的宽纵比：[(1., 1.), (1., 0.5), (0.5, 1.)]，并为每类都配置了scale系数：[0.7, 1., 1.3]。
186 | 
187 | ```
188 | cells = np.array([4, 2, 1])
189 | center_offsets = 1 / cells / 2
190 | aspect_ratios = [(1., 1.), (1., .5), (.5, 1.)]
191 | zooms = [0.7, 1., 1.3]
192 | scales = [(o * i, o * j) for o in zooms for i, j in aspect_ratios]
193 | k = len(scales)
194 | k, scales
195 | 
196 | (9,
197 |  [(0.7, 0.7),
198 |   (0.7, 0.35),
199 |   (0.35, 0.7),
200 |   (1.0, 1.0),
201 |   (1.0, 0.5),
202 |   (0.5, 1.0),
203 |   (1.3, 1.3),
204 |   (1.3, 0.65),
205 |   (0.65, 1.3)])
206 | ```
207 | 
208 | k就是每个default box根据宽纵比产生的变化数。如果把default box比作相机，k则是为该相机配备的专业镜头数，不同拍摄场景使用不同的镜头。
209 | 
210 | ![Figure 3: (4x4 + 2x2 + 1x1) * k grid cells](https://upload-images.jianshu.io/upload_images/13575947-425120d2063c003c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
211 | 
212 | 可以看到，Figure 3就比Figure 2精确很多，当然它的default box也比后者多得多： (4x4 + 2x2 + 1x1) * k。
213 | 
214 | ## Loss Function
215 | 
216 | SSD的损失函数和我们在[上集](https://www.jianshu.com/p/8d894605bb06)的类似，分别计算bounding box loss（loc loss）和classification loss（conf loss），它们的总和就是最终loss:
217 | 
218 | ![loss function.png](https://upload-images.jianshu.io/upload_images/13575947-629062249c081da0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
219 | 
220 | loc loss是经bounding box offset（SSD模型的输出）修正后的default box和ground truth的L1 loss。conf loss则是binary cross entropy。
221 | 
222 | ```
223 | class BCELoss(nn.Module):
224 |   def __init__(self, num_classes):
225 |     super().__init__()
226 |     self.num_classes = num_classes
227 |     
228 |   def get_weight(self, x, t): return None
229 |   
230 |   def forward(self, x, t):
231 |     x = x[:, :-1]
232 |     one_hot_t = torch.eye(num_classes + 1)[t.data.cpu()]
233 |     t = V(one_hot_t[:, :-1].contiguous())
234 |     w = self.get_weight(x, t)
235 |     return F.binary_cross_entropy_with_logits(x, t, w, size_average=False) / self.num_classes
236 | 
237 | bce_loss_f = BCELoss(num_classes)
238 | 
239 | def loc_loss(preds, targs):
240 |   return (preds - targs).abs().mean()
241 | 
242 | def conf_loss(preds, targs):
243 |   return bce_loss_f(preds, targs)
244 | ```
245 | 
246 | BCELoss去掉background分类的预测结果是因为_ssd_loss()构建的db_clas包含了不属于数据集的background分类。之所以要将conf_loss的结果除以self.num_classes，是因为如果binary cross entropy采用sum而不非mean来处理loss，conf_loss就会偏大，反之如果采用mean来处理，conf_loss就会偏小，不管是偏大还是偏小，都不利于模型训练，所以解决方法就是采用像前文的bias初始化那样主动降低loss值，这里采用的方法是除以20。
247 | 
248 | ```
249 | def offset_to_bb(off, db_bb):
250 |     off = F.tanh(off)
251 |     center = (off[:, :2] / 2) * db_bb[:, 2:] + db_bb[:, :2]
252 |     wh = ((off[:, 2:] / 2) + 1) * db_bb[:, 2:]
253 |     return def_box_to_bb(center, wh)
254 | 
255 | def _ssd_loss(db_offset, clas, bb_gt, clas_gt):
256 |   bb = offset_to_bb(db_offset, def_box)
257 |   bb_gt = bb_gt.view(-1, 4) / sz
258 |   idxs = np.nonzero(bb_gt[:, 2] > 0)[:, 0]
259 |   bb_gt, clas_gt = bb_gt[idxs], clas_gt[idxs]
260 |   overlap = jaccard(bb_gt, def_box_bb)
261 |   gt_best_overlap, gt_db_idx = overlap.max(1)
262 |   db_best_overlap, db_gt_idx = overlap.max(0)
263 |   db_best_overlap[gt_db_idx] = 1.1
264 |   for i, o in enumerate(gt_db_idx): db_gt_idx[o] = i
265 |   is_obj = db_best_overlap >= 0.5
266 |   pos_idxs = np.nonzero(is_obj)[:, 0]
267 |   neg_idxs = np.nonzero(1 - is_obj.data)[:, 0]
268 |   db_clas = clas_gt[db_gt_idx]
269 |   db_clas[neg_idxs] = len(labels)
270 |   db_bb = bb_gt[db_gt_idx]
271 |   return (loc_loss(bb[pos_idxs], db_bb[pos_idxs]), bce_loss_f(clas, db_clas))
272 | 
273 | def ssd_loss(preds, targs, print_loss=False):
274 | #   alpha = 1.
275 |   loc_loss, conf_loss = 0., 0.
276 |   for i, (db_offset, clas, bb_gt, clas_gt) in enumerate(zip(*preds, *targs)):
277 |     losses = _ssd_loss(db_offset, clas, bb_gt, clas_gt)
278 |     loc_loss += losses[0]# * alpha
279 |     conf_loss += losses[1]
280 |   if print_loss:
281 |     print(f'loc loss: {loc_loss:.2f}, conf loss: {conf_loss:.2f}')
282 |   return loc_loss + conf_loss
283 | ```
284 | 
285 | _ssd_loss()中，offset_to_bb()的作用就是根据bounding box offset来修正default box。bounding box offset的值是default box的scale系数，不仅移动default box的位置，还会改变default box的宽高。_ssd_loss()中很多代码在前面已经讲解过了，其目的就是根据ground truth重新构建以default box为基础的ground truth，之所以这样做是因为我们要预测每个default box中的分类。
286 | 
287 | ## Train 4x4
288 | 
289 | 终于来到模型训练阶段了，为了便于调试，我们先只训练4x4网格模型，使用"SSD Network Part 1"定义的模型。
290 | 
291 | ```
292 | lr = 1e-2
293 | learn.fit(lr, 1, cycle_len=8, use_clr=(20, 5))
294 | learn.save('16')
295 | 
296 | epoch      trn_loss   val_loss   
297 |     0      33.574218  34.117771 
298 |     1      30.093091  29.408577 
299 |     2      27.206728  27.568285 
300 |     3      25.348878  26.957813 
301 |     4      23.976828  26.765239 
302 |     5      22.80882   26.695604 
303 |     6      21.532631  26.688388 
304 |     7      20.018111  26.610572 
305 | ```
306 | ![](https://upload-images.jianshu.io/upload_images/13575947-f47c49ab24ef7447.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
307 | 
308 | 从测试结果可以看到，default box不再像之前那样整齐划一，box的大小也略有不同。它们看似凌乱，但实际上都是基于原来的位置的偏移，这点从box编号可以看出。总体来说，模型预测结果比静态default box要更准确些。接下来我们来增加更多的default box。
309 | 
310 | ## SSD Network Part 2
311 | 
312 | ```
313 | class Outlayer(nn.Module):
314 |   def __init__(self, nf, num_classes, bias):
315 |     super().__init__()
316 |     self.clas_conv = nn.Conv2d(nf, (num_classes + 1) * k, 3, 1, 1)
317 |     self.bb_conv = nn.Conv2d(nf, 4 * k, 3, 1, 1)
318 |     self.clas_conv.bias.data.zero_().add_(bias)
319 |   
320 |   def flatten(self, x):
321 |     bs, nf, w, h = x.size()
322 |     x = x.permute(0, 2, 3, 1).contiguous()
323 |     return x.view(bs, -1, nf // k)
324 |   
325 |   def forward(self, x):
326 |     return [self.flatten(self.bb_conv(x)), self.flatten(self.clas_conv(x))]
327 | 
328 | class SSDHead(nn.Module):
329 |   def __init__(self, num_classes, nf, bias, drop_i=0.25, drop_h=0.1):
330 |     super().__init__()
331 |     self.conv1 = conv_layer(512, nf, stride=1, drop=drop_h)
332 |     self.conv2 = conv_layer(nf, nf, drop=drop_h)   # 4x4
333 |     self.conv3 = conv_layer(nf, nf, drop=drop_h)   # 2x2
334 |     self.conv4 = conv_layer(nf, nf, drop=drop_h)   # 1x1
335 |     self.drop_i = nn.Dropout(drop_i)
336 |     self.out1 = Outlayer(nf, num_classes, bias)
337 |     self.out2 = Outlayer(nf, num_classes, bias)
338 |     self.out3 = Outlayer(nf, num_classes, bias)
339 |   
340 |   def forward(self, x):
341 |     x = self.drop_i(F.relu(x))
342 |     x = self.conv1(x)
343 |     x = self.conv2(x)
344 |     bb1, clas1 = self.out1(x)
345 |     x = self.conv3(x)
346 |     bb2, clas2 = self.out2(x)
347 |     x = self.conv4(x)
348 |     bb3, clas3 = self.out3(x)
349 |     return [torch.cat([bb1, bb2, bb3], 1),
350 |             torch.cat([clas1, clas2, clas3], 1)]
351 | 
352 | drops = [0.4, 0.2]
353 | ssd_head_f = SSDHead(num_classes, nf, -4., drop_i=drops[0], drop_h=drops[1])
354 | ```
355 | SSD将4x4、2x2、1x1三种不同大小的detector的预测结果汇总在一起，因为每个default box会有k种变化，所以每个detector的输出是原来的k倍。从之前的训练结果来看，现在正则化程度不够，所以我增加dropout的概率。
356 | 
357 | ```
358 | lr = 1e-2
359 | learn.fit(lr, 1, cycle_len=10, use_clr=(20, 10))
360 | learn.save('multi')
361 | 
362 | epoch      trn_loss   val_loss   
363 |     0      87.026507  75.858966 
364 |     1      68.657919  62.675859 
365 |     2      58.815842  78.257847 
366 |     3      53.675965  54.85459  
367 |     4      49.656684  53.707109 
368 |     5      46.777794  53.003534 
369 |     6      44.20865   51.358076 
370 |     7      41.394307  51.515281 
371 |     8      38.741202  50.559135 
372 |     9      36.69472   50.12559  
373 | ```
374 | ![](https://upload-images.jianshu.io/upload_images/13575947-e6caf5a5f83f01e5.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
375 | 可以看到，bounding box比原来的要大，这正是我们所希望看到的，但桌子上的酒瓶却没有被框定，原因何在？原因在于调整后的default box整体比之前偏大，因为酒瓶比较小，所以它的overlap < 0.5，无法被定位，所以最有效的解决方法是减少overlap thresh，比如将overlap thresh调整为0.4。
376 | 
377 | ## NMS
378 | 
379 | SSD模型的最后一层是nms，它的作用就是对筛选出那些大于某个jaccard overlap thresh的bounding box，我选出的是jaccard overlap > 0.4的前50个bounding box用于测试。
380 | 
381 | ![](https://upload-images.jianshu.io/upload_images/13575947-8fa43060bac29343.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
382 | 
383 | 从结果上看，并不是我们想要的结果，4个目标只有1个被检测出来了， 这又是为何？
384 | 
385 | ```
386 | x, y = next(iter(md.trn_dl))
387 | yp = predict_batch(learn.model, x)
388 | ssd_loss(yp, y, True)
389 | 
390 | loc loss: 3.65, conf loss: 28.08
391 | tensor(31.7384, device='cuda:0', grad_fn=<AddBackward0>)
392 | ```
393 | 
394 | 原因就在于conf_loss太大，classification准确率低。从神经网络模型来看，location和classification只有最后一层是独立，其他层都是共享的，换句话说，如果classification准确率低，那location的准确率也高不到拿去，实际上，location是依赖于classification的，先识别再定位。
395 | 
396 | ## Focal Loss / [Paper](https://arxiv.org/abs/1708.02002)
397 | 
398 | ![](https://upload-images.jianshu.io/upload_images/13575947-ff420a35c83e81e8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
399 | 
400 | 从数学公式可以看出，focal loss是scale版的cross entropy，$-(1 - p_t)^\gamma$是可训练的scale值。在object dection中，focal loss的表现远胜于BCE，其背后的逻辑是：通过scale（放大/缩小）输出，将原本模糊不清的预测确定化。当gamma == 0时，focal loss就相当于corss entropy(CE)，如蓝色曲线所示，即使probability达到0.6，loss值还是>= 0.5，就好像是说：“我判断它不是分类B的概率是60%，恩，我还有继续努力优化参数，我行的”。当gamma == 2时，同样是probability达到0.6，loss值接近于0，就好像是说：“我判断它不是分类B的概率是60%，恩，根据我多年断案经验，它一定不是分类B，好了，虽然预测准确性不算高，但没关系，结案了，接下来我们应该把精力投入到那些准确率还很低的项目中，加油吧”。
401 | 
402 | focal loss会对well-classified examples降权，即降低它的loss值，也就是减少参数更新值，把更多优化空间留给预测概率较低的样本，从整体角度来优化模型。
403 | 
404 | ```
405 | class FocalLoss(BCELoss):
406 |   def get_weight(self, x, t):
407 |     alpha,gamma = 0.25,1
408 |     p = x.sigmoid()
409 |     pt = p*t + (1-p)*(1-t)
410 |     w = alpha*t + (1-alpha)*(1-t)
411 |     return w * (1-pt).pow(gamma)
412 | 
413 | bce_loss_f = FocalLoss(num_classes)
414 | lr = 1e-2
415 | learn.fit(lr, 1, cycle_len=10, use_clr=(20, 10))
416 | learn.save('focal_loss')
417 | 
418 | epoch      trn_loss   val_loss   
419 |     0      17.30767   18.866698 
420 |     1      15.211579  13.772004 
421 |     2      13.563804  13.015255 
422 |     3      12.589626  12.785115 
423 |     4      11.926406  12.28807  
424 |     5      11.515744  11.814605 
425 |     6      11.109133  11.686357 
426 |     7      10.664063  11.424233 
427 |     8      10.285392  11.338397 
428 |     9      9.935587   11.185435 
429 | ```
430 | 
431 | ![](https://upload-images.jianshu.io/upload_images/13575947-5b6dfc604ba62249.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
432 | 
433 | 和预期一样，虽然主体物体detector的预测准确率降低了（从0.77降低到0.5），但其他物体detector的预测准确率也提升了。除了酒瓶之外（原因前面已经分析了），另外三个物体都被准确检测出来。
434 | 
435 | ## END
436 | 
437 | SSD就像一个没有天赋但却很勤奋的摄影师，每次拍摄他都遵循同一套流程，取景、移动镜头到取景框中心位置、咔嚓一声摁下快门，但他又是了不起的，可以不厌其烦地选取各个拍摄角度和各种取景框。到这里，我已经完成了对SSD算法理解的分享，这趟旅程可能会比较烧脑，你需要结合[代码]()和paper来学习。
438 | 


--------------------------------------------------------------------------------