├── README.md ├── pascal_voc2012_ssd.ipynb └── pascal_voc_2012.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # 搞定目标检测(SSD篇) 2 | 3 | ![](https://upload-images.jianshu.io/upload_images/13575947-08e4cd04dd185415.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 4 | 5 | 通过对[搞定目标检测(SSD篇)(上)](https://www.jianshu.com/p/8d894605bb06)的学习,你应该已经了解目标检测的基本原理和技术局限性,本文将会详解如何实现SSD目标检测模型。先打个预防针,本文的内容会比较烧脑,而且默认你已经掌握了[上集](https://www.jianshu.com/p/8d894605bb06)的内容,当然我也会用平实的语言尽力给你讲清楚。Github: [https://github.com/alexshuang/pascal-voc-pytorch](https://github.com/alexshuang/pascal-voc-pytorch)。 6 | 7 | ## SSD / [Paper](https://arxiv.org/abs/1512.02325) / [Notebook](https://github.com/alexshuang/pascal-voc-pytorch/blob/master/pascal_voc2012_ssd.ipynb) 8 | 9 | 从SSD的全称,Single Shot MultiBox Detector,就可以窥探算法的本质:“Single Shot”指的是单目标检测,“MultiBox”中的“Box”就像是我们平时拍摄时用到的取景框,只关注框内的画面,屏蔽框外的内容。创建“Multi”个"Box",将每个"Box"的单目标检测结果汇总起来就是多目标检测。换句话说,SSD将图像切分为N片,并对每片进行独立的单目标检测,最后汇总每片的检测结果。 10 | 11 | ![Figure 1: SSD arch](https://upload-images.jianshu.io/upload_images/13575947-47dbeb80969ad6ac.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 12 | 13 | SSD切分图像的方法就是Convolution,或者说是Receptive Field。如架构图所示,SSD的top layers(extra feature layers)是卷积层。假设卷积计算的结果是:[64, 25, 4, 4],它指的是在4x4大小的feature map中,每个grid cell代表了原始图像中的一个区域。换句话说,如果用4x4的网格平铺整个图像,feature map中的每个grid cell对应一个网格区域。 14 | 15 | ![Figure 2: 4x4 grid cells](https://upload-images.jianshu.io/upload_images/13575947-4eebfe8d0f62cc91.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 16 | 17 | Figure 2中的网格就是MultiBox,每个Box的单目标检测的结果就保存在卷积运算的channels维度。对于矩阵 [64, 25, 4, 4],25 = bounding box + 分类概率(额外增加"Background"分类) = 4 + 21。 18 | 19 | 如Figure 1所示,SSD通过pooling层(or stride)不断调整网格的数量,例如从4x4 -> 2x2 -> 1x1,并将所有结果汇总起来,这样就可以使用不同大小的Box来锚定不同大小的物体。 20 | 21 | ## Classification 22 | 23 | 延续上集的思路,我将多目标检测也分解为分类(Classification)和定位(Location)两个独立操作。相比单目标检测,多目标检测模型最终用sigmoid()而不是softmax()来生成分类的概率。为检验Classification模型的预测准确率,我选取所有概率大于thresh(0.4)的分类,可以看到,Classification模型是work的(降低thresh可以显示更多的分类)。 24 | 25 | ![](https://upload-images.jianshu.io/upload_images/13575947-ec8c0b3fa708ba48.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 26 | 27 | ## Ground Truth (Location) 28 | 29 | Ground Truth指的是图像的标注信息,如classification、bounding box、segmentation等信息。 30 | 31 | ``` 32 | i = 9 33 | bb = y[0][i].view(-1, 4) 34 | clas = y[1][i] 35 | bb, clas, bb.shape, clas.shape 36 | 37 | (tensor([[ 0., 0., 0., 0.], 38 | [ 0., 0., 0., 0.], 39 | [ 0., 0., 0., 0.], 40 | [105., 4., 161., 28.], 41 | [ 70., 0., 149., 66.], 42 | [ 50., 24., 185., 129.], 43 | [ 19., 60., 223., 222.]], device='cuda:0'), 44 | tensor([ 0, 0, 0, 4, 14, 14, 14], device='cuda:0'), 45 | torch.Size([7, 4]), 46 | torch.Size([7])) 47 | ``` 48 | 49 | 由于每个训练样本的ground truth个数不同,为了保证mini-batch矩阵的一致性,Pytorch会用0来填充y矩阵,因此,在使用数据时,需要先剔除bounding box全0的ground truth。 50 | 51 | ``` 52 | i = 9 53 | fig, ax = plt.subplots(figsize=(6, 4)) 54 | ax.imshow(ima[i]) 55 | draw_gt(ax, y[0][i].view(-1, 4), y[1][i], num_classes=len(labels)) 56 | ax.axis('off') 57 | ``` 58 | 59 | ![](https://upload-images.jianshu.io/upload_images/13575947-6cb5940151cbb60e.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 60 | 61 | --- 62 | 63 | ## SSD Network Part 1 64 | 65 | ``` 66 | def conv_layer(nin, nf, stride=2, drop=0.1): 67 | return nn.Sequential( 68 | nn.Conv2d(nin, nf, 3, stride, 1, bias=False), 69 | nn.ReLU(), 70 | nn.BatchNorm2d(nf), 71 | nn.Dropout(drop) 72 | ) 73 | 74 | class Outlayer(nn.Module): 75 | def __init__(self, nf, num_classes, bias): 76 | super().__init__() 77 | self.clas_conv = nn.Conv2d(nf, num_classes + 1, 3, 1, 1) 78 | self.bb_conv = nn.Conv2d(nf, 4, 3, 1, 1) 79 | self.clas_conv.bias.data.zero_().add_(bias) 80 | 81 | def flatten(self, x): 82 | bs, nf, w, h = x.size() 83 | x = x.permute(0, 2, 3, 1).contiguous() 84 | return x.view(bs, -1, nf) 85 | 86 | def forward(self, x): 87 | return [self.flatten(self.bb_conv(x)), self.flatten(self.clas_conv(x))] 88 | 89 | class SSDHead(nn.Module): 90 | def __init__(self, num_classes, nf, bias, drop_i=0.25): 91 | super().__init__() 92 | self.conv1 = conv_layer(512, nf, stride=1) 93 | self.conv2 = conv_layer(nf, nf) 94 | self.drop_i = nn.Dropout(drop_i) 95 | self.out = Outlayer(nf, num_classes, bias=bias) 96 | 97 | def forward(self, x): 98 | x = self.drop_i(F.relu(x)) 99 | x = self.conv1(x) 100 | x = self.conv2(x) 101 | return self.out(x) 102 | 103 | ssd_head_f = SSDHead(num_classes, nf, bias=-3.) 104 | ``` 105 | 106 | 我使用的backbone是Resnet34,它最终的输出结果是7x7x512,经过stride=2的conv2()后,得到如Figure 2所示的4x4 freature map。通过Outlayer()生成channels分别为4和21的输出,前者与bounding box相关,后者是分类概率。之所以将clas_conv层的bias初始化为-3,是因为模型输出的总loss值偏大,虽然可以通过后续训练降低loss值,但模型却达不到期望效果,通过初始化赋值可以解决这个问题。 107 | 108 | > 为什么是“与bounding box有关”,而不是bounding box? 109 | [搞定目标检测(SSD篇)(上)](https://www.jianshu.com/p/8d894605bb06)已经提到,Resnet这类图像识别模型并不擅长生成空间数据,因此SSD生成的并不是bounding box,而是bounding box相对于default box的偏移(offset),而default box则是预先定义好的bounding box,如Figure 2中的网格。 110 | 111 | ## Default Box 112 | 113 | Default Box就是“MultiBox”,是SSD的取景框,即Figure 2中的网格。它由[中心x、y坐标,width,height]组成。 114 | 115 | ``` 116 | cells = 4 117 | width = 1 / cells 118 | cx = np.repeat(np.linspace(width / 2, 1 - (width / 2), cells), cells) 119 | cy = np.tile(np.linspace(width / 2, 1 - (width / 2), cells), cells) 120 | w = h = np.array([width] * cells**2) 121 | def_box = T(np.stack([cx, cy, w, h], 1)) 122 | def_box 123 | 124 | tensor([[0.1250, 0.1250, 0.2500, 0.2500], 125 | [0.1250, 0.3750, 0.2500, 0.2500], 126 | [0.1250, 0.6250, 0.2500, 0.2500], 127 | [0.1250, 0.8750, 0.2500, 0.2500], 128 | [0.3750, 0.1250, 0.2500, 0.2500], 129 | [0.3750, 0.3750, 0.2500, 0.2500], 130 | [0.3750, 0.6250, 0.2500, 0.2500], 131 | [0.3750, 0.8750, 0.2500, 0.2500], 132 | [0.6250, 0.1250, 0.2500, 0.2500], 133 | [0.6250, 0.3750, 0.2500, 0.2500], 134 | [0.6250, 0.6250, 0.2500, 0.2500], 135 | [0.6250, 0.8750, 0.2500, 0.2500], 136 | [0.8750, 0.1250, 0.2500, 0.2500], 137 | [0.8750, 0.3750, 0.2500, 0.2500], 138 | [0.8750, 0.6250, 0.2500, 0.2500], 139 | [0.8750, 0.8750, 0.2500, 0.2500]], device='cuda:0') 140 | ``` 141 | 142 | > 你是否注意到Figure 2图中的很多网格被识别为background,是模型错了? 143 | Figure 2是default boxes和ground truth相互匹配后得到的结果,实际上,因为没有和ground truth大小相似的default box,因此只能选择最适配的default box,但因为两者大小相差悬殊,所以才产生了错配的感觉。 144 | 145 | ## Jaccard Index 146 | 147 | default box和ground truth是通过jaccard index相互匹配的。通过jaccard()计算每个default box和每个ground truth的交并比 -- overlap,那些overlap > 0.5的default box index,就是jaccard index。通过jaccard index,可以知道default box对应哪个ground truth。 148 | 149 | ![](https://upload-images.jianshu.io/upload_images/13575947-73fcb1fd4e5659c2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 150 | 151 | ``` 152 | def box_size(box): return (box[:, 2] - box[:, 0]) * (box[:, 3] - box[:, 1]) 153 | 154 | def intersection(gt, def_box): 155 | left_top = torch.max(gt[:, None, :2], def_box[None, :, :2]) 156 | right_bottom = torch.min(gt[:, None, 2:], def_box[None, :, 2:]) 157 | wh = torch.clamp(right_bottom - left_top, min=0) 158 | return wh[:, :, 0] * wh[:, :, 1] 159 | 160 | def jaccard(gt, def_box): 161 | inter = intersection(gt, def_box) 162 | union = box_size(gt).unsqueeze(1) + box_size(def_box).unsqueeze(0) - inter 163 | return inter / union 164 | 165 | overlap = jaccard(bb, def_box_bb * sz) 166 | gt_best_overlap, gt_db_idx = overlap.max(1) 167 | db_best_overlap, db_gt_idx = overlap.max(0) 168 | db_best_overlap[gt_db_idx] = 1.1 169 | is_obj = db_best_overlap > 0.5 170 | pos_idxs = np.nonzero(is_obj)[:, 0] 171 | neg_idxs = np.nonzero(1 - is_obj)[:, 0] 172 | db_clas = T([num_classes] * len(db_best_overlap)) 173 | db_clas[pos_idxs] = clas[db_gt_idx[pos_idxs]] 174 | db_best_overlap, db_clas 175 | ``` 176 | 177 | db_gt_idx指的是,每个default box对应的ground truth id。db_best_overlap是指每个default box内最大的jaccard,jaccard最大的default_box也不一定都满足> 0.5的要求(如Figure 2),所以主动将ground truth所对应的default_box的overlap提升为1.1。db_clas就是jaccard index。 178 | 179 | ## More Default Boxes 180 | 181 | 还记得SSD的架构吗,随着extra feature layers的深入,feature map的网格越来越大,从4x4->2x2->1x1,也就是说,它可以匹配更多体型的物体。除此之外,SSD还会利用不同的宽纵比,创建大小相同但形状不同的default box: 182 | 183 | ![](https://upload-images.jianshu.io/upload_images/13575947-2cac943e12e99d62.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 184 | 185 | 如图所示,每个default box可以分为3大类:宽比高长、高比宽长、等长,所以我采用的宽纵比:[(1., 1.), (1., 0.5), (0.5, 1.)],并为每类都配置了scale系数:[0.7, 1., 1.3]。 186 | 187 | ``` 188 | cells = np.array([4, 2, 1]) 189 | center_offsets = 1 / cells / 2 190 | aspect_ratios = [(1., 1.), (1., .5), (.5, 1.)] 191 | zooms = [0.7, 1., 1.3] 192 | scales = [(o * i, o * j) for o in zooms for i, j in aspect_ratios] 193 | k = len(scales) 194 | k, scales 195 | 196 | (9, 197 | [(0.7, 0.7), 198 | (0.7, 0.35), 199 | (0.35, 0.7), 200 | (1.0, 1.0), 201 | (1.0, 0.5), 202 | (0.5, 1.0), 203 | (1.3, 1.3), 204 | (1.3, 0.65), 205 | (0.65, 1.3)]) 206 | ``` 207 | 208 | k就是每个default box根据宽纵比产生的变化数。如果把default box比作相机,k则是为该相机配备的专业镜头数,不同拍摄场景使用不同的镜头。 209 | 210 | ![Figure 3: (4x4 + 2x2 + 1x1) * k grid cells](https://upload-images.jianshu.io/upload_images/13575947-425120d2063c003c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 211 | 212 | 可以看到,Figure 3就比Figure 2精确很多,当然它的default box也比后者多得多: (4x4 + 2x2 + 1x1) * k。 213 | 214 | ## Loss Function 215 | 216 | SSD的损失函数和我们在[上集](https://www.jianshu.com/p/8d894605bb06)的类似,分别计算bounding box loss(loc loss)和classification loss(conf loss),它们的总和就是最终loss: 217 | 218 | ![loss function.png](https://upload-images.jianshu.io/upload_images/13575947-629062249c081da0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 219 | 220 | loc loss是经bounding box offset(SSD模型的输出)修正后的default box和ground truth的L1 loss。conf loss则是binary cross entropy。 221 | 222 | ``` 223 | class BCELoss(nn.Module): 224 | def __init__(self, num_classes): 225 | super().__init__() 226 | self.num_classes = num_classes 227 | 228 | def get_weight(self, x, t): return None 229 | 230 | def forward(self, x, t): 231 | x = x[:, :-1] 232 | one_hot_t = torch.eye(num_classes + 1)[t.data.cpu()] 233 | t = V(one_hot_t[:, :-1].contiguous()) 234 | w = self.get_weight(x, t) 235 | return F.binary_cross_entropy_with_logits(x, t, w, size_average=False) / self.num_classes 236 | 237 | bce_loss_f = BCELoss(num_classes) 238 | 239 | def loc_loss(preds, targs): 240 | return (preds - targs).abs().mean() 241 | 242 | def conf_loss(preds, targs): 243 | return bce_loss_f(preds, targs) 244 | ``` 245 | 246 | BCELoss去掉background分类的预测结果是因为_ssd_loss()构建的db_clas包含了不属于数据集的background分类。之所以要将conf_loss的结果除以self.num_classes,是因为如果binary cross entropy采用sum而不非mean来处理loss,conf_loss就会偏大,反之如果采用mean来处理,conf_loss就会偏小,不管是偏大还是偏小,都不利于模型训练,所以解决方法就是采用像前文的bias初始化那样主动降低loss值,这里采用的方法是除以20。 247 | 248 | ``` 249 | def offset_to_bb(off, db_bb): 250 | off = F.tanh(off) 251 | center = (off[:, :2] / 2) * db_bb[:, 2:] + db_bb[:, :2] 252 | wh = ((off[:, 2:] / 2) + 1) * db_bb[:, 2:] 253 | return def_box_to_bb(center, wh) 254 | 255 | def _ssd_loss(db_offset, clas, bb_gt, clas_gt): 256 | bb = offset_to_bb(db_offset, def_box) 257 | bb_gt = bb_gt.view(-1, 4) / sz 258 | idxs = np.nonzero(bb_gt[:, 2] > 0)[:, 0] 259 | bb_gt, clas_gt = bb_gt[idxs], clas_gt[idxs] 260 | overlap = jaccard(bb_gt, def_box_bb) 261 | gt_best_overlap, gt_db_idx = overlap.max(1) 262 | db_best_overlap, db_gt_idx = overlap.max(0) 263 | db_best_overlap[gt_db_idx] = 1.1 264 | for i, o in enumerate(gt_db_idx): db_gt_idx[o] = i 265 | is_obj = db_best_overlap >= 0.5 266 | pos_idxs = np.nonzero(is_obj)[:, 0] 267 | neg_idxs = np.nonzero(1 - is_obj.data)[:, 0] 268 | db_clas = clas_gt[db_gt_idx] 269 | db_clas[neg_idxs] = len(labels) 270 | db_bb = bb_gt[db_gt_idx] 271 | return (loc_loss(bb[pos_idxs], db_bb[pos_idxs]), bce_loss_f(clas, db_clas)) 272 | 273 | def ssd_loss(preds, targs, print_loss=False): 274 | # alpha = 1. 275 | loc_loss, conf_loss = 0., 0. 276 | for i, (db_offset, clas, bb_gt, clas_gt) in enumerate(zip(*preds, *targs)): 277 | losses = _ssd_loss(db_offset, clas, bb_gt, clas_gt) 278 | loc_loss += losses[0]# * alpha 279 | conf_loss += losses[1] 280 | if print_loss: 281 | print(f'loc loss: {loc_loss:.2f}, conf loss: {conf_loss:.2f}') 282 | return loc_loss + conf_loss 283 | ``` 284 | 285 | _ssd_loss()中,offset_to_bb()的作用就是根据bounding box offset来修正default box。bounding box offset的值是default box的scale系数,不仅移动default box的位置,还会改变default box的宽高。_ssd_loss()中很多代码在前面已经讲解过了,其目的就是根据ground truth重新构建以default box为基础的ground truth,之所以这样做是因为我们要预测每个default box中的分类。 286 | 287 | ## Train 4x4 288 | 289 | 终于来到模型训练阶段了,为了便于调试,我们先只训练4x4网格模型,使用"SSD Network Part 1"定义的模型。 290 | 291 | ``` 292 | lr = 1e-2 293 | learn.fit(lr, 1, cycle_len=8, use_clr=(20, 5)) 294 | learn.save('16') 295 | 296 | epoch trn_loss val_loss 297 | 0 33.574218 34.117771 298 | 1 30.093091 29.408577 299 | 2 27.206728 27.568285 300 | 3 25.348878 26.957813 301 | 4 23.976828 26.765239 302 | 5 22.80882 26.695604 303 | 6 21.532631 26.688388 304 | 7 20.018111 26.610572 305 | ``` 306 | ![](https://upload-images.jianshu.io/upload_images/13575947-f47c49ab24ef7447.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 307 | 308 | 从测试结果可以看到,default box不再像之前那样整齐划一,box的大小也略有不同。它们看似凌乱,但实际上都是基于原来的位置的偏移,这点从box编号可以看出。总体来说,模型预测结果比静态default box要更准确些。接下来我们来增加更多的default box。 309 | 310 | ## SSD Network Part 2 311 | 312 | ``` 313 | class Outlayer(nn.Module): 314 | def __init__(self, nf, num_classes, bias): 315 | super().__init__() 316 | self.clas_conv = nn.Conv2d(nf, (num_classes + 1) * k, 3, 1, 1) 317 | self.bb_conv = nn.Conv2d(nf, 4 * k, 3, 1, 1) 318 | self.clas_conv.bias.data.zero_().add_(bias) 319 | 320 | def flatten(self, x): 321 | bs, nf, w, h = x.size() 322 | x = x.permute(0, 2, 3, 1).contiguous() 323 | return x.view(bs, -1, nf // k) 324 | 325 | def forward(self, x): 326 | return [self.flatten(self.bb_conv(x)), self.flatten(self.clas_conv(x))] 327 | 328 | class SSDHead(nn.Module): 329 | def __init__(self, num_classes, nf, bias, drop_i=0.25, drop_h=0.1): 330 | super().__init__() 331 | self.conv1 = conv_layer(512, nf, stride=1, drop=drop_h) 332 | self.conv2 = conv_layer(nf, nf, drop=drop_h) # 4x4 333 | self.conv3 = conv_layer(nf, nf, drop=drop_h) # 2x2 334 | self.conv4 = conv_layer(nf, nf, drop=drop_h) # 1x1 335 | self.drop_i = nn.Dropout(drop_i) 336 | self.out1 = Outlayer(nf, num_classes, bias) 337 | self.out2 = Outlayer(nf, num_classes, bias) 338 | self.out3 = Outlayer(nf, num_classes, bias) 339 | 340 | def forward(self, x): 341 | x = self.drop_i(F.relu(x)) 342 | x = self.conv1(x) 343 | x = self.conv2(x) 344 | bb1, clas1 = self.out1(x) 345 | x = self.conv3(x) 346 | bb2, clas2 = self.out2(x) 347 | x = self.conv4(x) 348 | bb3, clas3 = self.out3(x) 349 | return [torch.cat([bb1, bb2, bb3], 1), 350 | torch.cat([clas1, clas2, clas3], 1)] 351 | 352 | drops = [0.4, 0.2] 353 | ssd_head_f = SSDHead(num_classes, nf, -4., drop_i=drops[0], drop_h=drops[1]) 354 | ``` 355 | SSD将4x4、2x2、1x1三种不同大小的detector的预测结果汇总在一起,因为每个default box会有k种变化,所以每个detector的输出是原来的k倍。从之前的训练结果来看,现在正则化程度不够,所以我增加dropout的概率。 356 | 357 | ``` 358 | lr = 1e-2 359 | learn.fit(lr, 1, cycle_len=10, use_clr=(20, 10)) 360 | learn.save('multi') 361 | 362 | epoch trn_loss val_loss 363 | 0 87.026507 75.858966 364 | 1 68.657919 62.675859 365 | 2 58.815842 78.257847 366 | 3 53.675965 54.85459 367 | 4 49.656684 53.707109 368 | 5 46.777794 53.003534 369 | 6 44.20865 51.358076 370 | 7 41.394307 51.515281 371 | 8 38.741202 50.559135 372 | 9 36.69472 50.12559 373 | ``` 374 | ![](https://upload-images.jianshu.io/upload_images/13575947-e6caf5a5f83f01e5.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 375 | 可以看到,bounding box比原来的要大,这正是我们所希望看到的,但桌子上的酒瓶却没有被框定,原因何在?原因在于调整后的default box整体比之前偏大,因为酒瓶比较小,所以它的overlap < 0.5,无法被定位,所以最有效的解决方法是减少overlap thresh,比如将overlap thresh调整为0.4。 376 | 377 | ## NMS 378 | 379 | SSD模型的最后一层是nms,它的作用就是对筛选出那些大于某个jaccard overlap thresh的bounding box,我选出的是jaccard overlap > 0.4的前50个bounding box用于测试。 380 | 381 | ![](https://upload-images.jianshu.io/upload_images/13575947-8fa43060bac29343.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 382 | 383 | 从结果上看,并不是我们想要的结果,4个目标只有1个被检测出来了, 这又是为何? 384 | 385 | ``` 386 | x, y = next(iter(md.trn_dl)) 387 | yp = predict_batch(learn.model, x) 388 | ssd_loss(yp, y, True) 389 | 390 | loc loss: 3.65, conf loss: 28.08 391 | tensor(31.7384, device='cuda:0', grad_fn=) 392 | ``` 393 | 394 | 原因就在于conf_loss太大,classification准确率低。从神经网络模型来看,location和classification只有最后一层是独立,其他层都是共享的,换句话说,如果classification准确率低,那location的准确率也高不到拿去,实际上,location是依赖于classification的,先识别再定位。 395 | 396 | ## Focal Loss / [Paper](https://arxiv.org/abs/1708.02002) 397 | 398 | ![](https://upload-images.jianshu.io/upload_images/13575947-ff420a35c83e81e8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 399 | 400 | 从数学公式可以看出,focal loss是scale版的cross entropy,$-(1 - p_t)^\gamma$是可训练的scale值。在object dection中,focal loss的表现远胜于BCE,其背后的逻辑是:通过scale(放大/缩小)输出,将原本模糊不清的预测确定化。当gamma == 0时,focal loss就相当于corss entropy(CE),如蓝色曲线所示,即使probability达到0.6,loss值还是>= 0.5,就好像是说:“我判断它不是分类B的概率是60%,恩,我还有继续努力优化参数,我行的”。当gamma == 2时,同样是probability达到0.6,loss值接近于0,就好像是说:“我判断它不是分类B的概率是60%,恩,根据我多年断案经验,它一定不是分类B,好了,虽然预测准确性不算高,但没关系,结案了,接下来我们应该把精力投入到那些准确率还很低的项目中,加油吧”。 401 | 402 | focal loss会对well-classified examples降权,即降低它的loss值,也就是减少参数更新值,把更多优化空间留给预测概率较低的样本,从整体角度来优化模型。 403 | 404 | ``` 405 | class FocalLoss(BCELoss): 406 | def get_weight(self, x, t): 407 | alpha,gamma = 0.25,1 408 | p = x.sigmoid() 409 | pt = p*t + (1-p)*(1-t) 410 | w = alpha*t + (1-alpha)*(1-t) 411 | return w * (1-pt).pow(gamma) 412 | 413 | bce_loss_f = FocalLoss(num_classes) 414 | lr = 1e-2 415 | learn.fit(lr, 1, cycle_len=10, use_clr=(20, 10)) 416 | learn.save('focal_loss') 417 | 418 | epoch trn_loss val_loss 419 | 0 17.30767 18.866698 420 | 1 15.211579 13.772004 421 | 2 13.563804 13.015255 422 | 3 12.589626 12.785115 423 | 4 11.926406 12.28807 424 | 5 11.515744 11.814605 425 | 6 11.109133 11.686357 426 | 7 10.664063 11.424233 427 | 8 10.285392 11.338397 428 | 9 9.935587 11.185435 429 | ``` 430 | 431 | ![](https://upload-images.jianshu.io/upload_images/13575947-5b6dfc604ba62249.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 432 | 433 | 和预期一样,虽然主体物体detector的预测准确率降低了(从0.77降低到0.5),但其他物体detector的预测准确率也提升了。除了酒瓶之外(原因前面已经分析了),另外三个物体都被准确检测出来。 434 | 435 | ## END 436 | 437 | SSD就像一个没有天赋但却很勤奋的摄影师,每次拍摄他都遵循同一套流程,取景、移动镜头到取景框中心位置、咔嚓一声摁下快门,但他又是了不起的,可以不厌其烦地选取各个拍摄角度和各种取景框。到这里,我已经完成了对SSD算法理解的分享,这趟旅程可能会比较烧脑,你需要结合[代码]()和paper来学习。 438 | --------------------------------------------------------------------------------