├── .gitignore ├── README.md ├── dataloader ├── __init__.py ├── dataaugmentor.py ├── dataset.py ├── en_decoder.py ├── getdataloader.py └── utils.py ├── evaluations ├── __init__.py └── voc_eval.py ├── imgs ├── DSOD.png ├── DSSD.png ├── DeformableConv.png ├── DenseBox.png ├── DetectorNet.png ├── ESSD.png ├── Extension_module.png ├── FCN.png ├── FCN_in_test.png ├── FPN.png ├── FSSD.png ├── FaceBoxes.png ├── Fast_R-CNN.png ├── Faster_R-CNN.png ├── FocalLoss.png ├── Instance_segmentation.png ├── Light-Head.png ├── MTCNN.png ├── MaskX.png ├── MaskX_show.png ├── Mask_R-CNN.png ├── R-CNN.png ├── R-FCN.png ├── RFB_module.png ├── ROIAlign.png ├── RetinaNet.png ├── SPP-net.png ├── SSD.png ├── SSD_model.png ├── YOLO.png ├── YOLO9000.png ├── YOLO_Bbox.png ├── YOLO_loss.png ├── YOLOv2.png ├── fc2Conv.png ├── focal_loss.png ├── inference_YOLO.png ├── offset_MaxPooling.png ├── position-sensitive_RoI_pooling.png ├── receptive_field.png └── skip_layers.png ├── models ├── __init__.py ├── loss │ ├── __init__.py │ └── focal_loss.py └── retinaNet │ ├── __init__.py │ ├── fpn.py │ ├── get_state_dict.py │ └── retina.py ├── retina.md └── train_test ├── eval.py └── train_retinanet.py /.gitignore: -------------------------------------------------------------------------------- 1 | output/ 2 | 3 | .idea/ 4 | __pycache__/ 5 | *.pyc 6 | 7 | # data or images 8 | [Dd]ataset/ 9 | [Dd]atasets/ 10 | images/ 11 | 12 | # preTrained models 13 | preTrainedModels/ 14 | 15 | # build 16 | build/ 17 | *.so 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Detector 2 | 使用PyTorch实现了经典的深度学习检测算法: 3 | * [OverFeat](#overfeat)(2013.12) 4 | * [R-CNN](#r-cnn)(2013.11) 5 | * [SPP-net](#spp-net)(2014.6) 6 | * [Fast R-CNN](#fast)(2015.4) 7 | * [Faster R-CNN](#faster)(2015.6) 8 | * [FCN](#fcn)(2014.11) 9 | * [R-FCN](#r-fcn)(2016.5) 10 | * [FPN](#fpn)(2016.12) 11 | * [Mask R-CNN](#mask)(2017.3) 12 | * [Mask^X R-CNN](#maskx)(2017.11) 13 | * [DetectorNet](#detectornet)(2013) 14 | * [DenseBox](#densebox)(2015.9) 15 | * [MTCNN](#mtcnn)(2016.4) 16 | * [FaceBoxes](#faceboxes)(2017.8) 17 | * [YOLO](#yolo)(2015.6) 18 | * [YOLOv2](#yolov2)(2016.12) 19 | * [YOLOv3](#yolov3)(2018.3) 20 | * [SSD](#ssd)(2015.12) 21 | * [DSSD](#dssd)(2017.1) 22 | * [FSSD](#fssd)(2017.12) 23 | * [ESSD](#essd)(2018.1) 24 | * [RFBNet](#rfbnet)(2017.11) 25 | * [DeformableConvNets](#deformableconvnets)(2017.3) 26 | * [DSOD](#dsod)(2017.8) 27 | * [**RetinaNet**](#retinanet)(2017.8) 28 | * [Light-Head R-CNN](#light-head)(2017.11) 29 | 30 | ------ 31 | ## Requisites: 32 | * anaconda 33 | * pytorch-0.3.0 34 | * torchvision 35 | * visdom 36 | 37 | ------ 38 | ## 经典的传统目标检测算法 39 | * Haar + AdaBoost 40 | * 参考论文1:Rapid Object Detection using a Boosted Cascade of Simple Features, 41 | Viola & Jones, 2001 42 | * 参考论文2:Robust Real-Time Face Detection, Viola & Jones, 2002 43 | * 参考论文3:Informed Haar-Like Features Improve Pedestrian Detection, 44 | ShanShan Zhang等, 2014 45 | * LBP + AdaBoost 46 | * 参考论文1:Multiresolution gray-scale and rotation invariant 47 | texture classification with local binary patterns, Ojala等, 2002 48 | * 参考论文2:Learning Multi-scale Block Local Binary Patterns for Face Recognition, 49 | Shengcai Liao, 2007 50 | * 参考论文3:局部二值模式方法研究与展望, 宋克臣, 2013 51 | * HOG + SVM(Cascade) 52 | * 参考论文1:Histograms of Oriented Gradients for Human Detection, 53 | Dalal & Triggs, 2005 54 | * 参考论文2:Fast Human Detection Using a Cascade of Histograms of Oriented 55 | Gradients, Qiang Zhu等, 2006 56 | * ACF + AdaBoost 57 | * 参考论文1:Integral Channel Features, Piotr Dollar等, 2009 58 | * 参考论文2:Fast Feature Pyramids for Object Detection, Piotr Dollar等, 2014 59 | * 参考论文3:Local Decorrelation For Improved Detection, Piotr Dollar等, 2014 60 | * DPM 61 | * 参考论文1:A Discriminatively Trained, Multiscale, Deformable Part Model, 62 | Pedro等, 2008 63 | * 参考论文2:Object Detection with Discriminatively Trained Part Based Models, 64 | Pedro & ross等, 2010 65 | * 参考论文3:Visual Object Detection with Deformable Part Models, Pedro & ross等, 66 | 2013 67 | * 参考论文4:Deformable Part Models are Convolutoinal Neural Networks, 68 | ross等, 2015 69 | 本工程主要实现基于深度学习的检测算法,对传统算法感兴趣的同学可以阅读上面列出的论文,或相关博客。 70 | 71 | [返回顶部](#detector) 72 | 73 | ------ 74 | ## 前排膜拜大牛 75 | * Ross Girshick(rbg): [个人主页](http://www.rossgirshick.info/), 主要成就: 76 | * DPM 77 | * R-CNN 78 | * Fast R-CNN 79 | * Faster R-CNN 80 | * YOLO 81 | * Kaiming He(何恺明): [个人主页](http://kaiminghe.com/), 主要成就: 82 | * 2003年广东省理科高考状元 83 | * 图像去雾 84 | * ResNet 85 | * MSRA 初始化 86 | * Group 正则化 87 | * PReLU 88 | * SPPNet 89 | * Faster R-CNN 90 | * Mask R-CNN 91 | * Mask^X R-CNN 92 | * 炉石传说 93 | 94 | [返回顶部](#detector) 95 | 96 | ------ 97 | ## OverFeat 98 | [OverFeat](https://arxiv.org/abs/1312.6229) 99 | 通过一个卷积网络来同时进行分类,定位和检测三个计算机视觉任务。 100 | 101 | ### 基础知识 102 | * 卷积网络在小数据集上作用不明显 103 | * 卷积网络最大的优点是不用人工设计特征;最大的缺点是需要大量标注的数据。 104 | 105 | ### offset max-pooling 106 | ![offset_MaxPooling](./imgs/offset_MaxPooling.png) 107 | * Pooling时每次的起点不一样,Pooling后得到了3\*3\*C个特征图 108 | * 该操作可以作为最后一层Pooling的方法,移除了Poolig操作本应该带来的分辨率损失 109 | 110 | ### FCN in test 111 | ![FCN_in_test](./imgs/FCN_in_test.png) 112 | * 在测试时将全连接层替换成1\*1的卷积层 113 | * 允许测试时输入不同大小的图像,等价与传统的滑动窗口方法,滑动步长取决于Pooling的次数。 114 | 115 | ### 主要创新点 116 | * offset pooling 117 | * 测试时将全连接层替换成1\*1的卷积层 118 | * 测试时使用不同大小图像作为的输出(Multi-Scale) 119 | * 卷积层参数共享: 固定卷积层的参数,将分类层替换层回归层,用于定位和检测。 120 | 121 | [返回顶部](#detector) 122 | 123 | ------ 124 | ## R-CNN 125 | [R-CNN](https://arxiv.org/abs/1311.2524) 126 | 第一次将CNN应用到目标检测上,在目标检测领域取得了巨大突破。 127 | 128 | ### Object detection system overview 129 | ![R-CNN](./imgs/R-CNN.png) 130 | * 候选区域(Region proposals):使用传统的区域提取方法, 131 | 通过滑动不同宽高的窗口获得了2K个潜在的候选区域。 132 | * 使用CNN提取特征:将每个候选区域‘reSize’到固定大小,最终获得了4096维的特征。 133 | * 使用SVM进行分类:每类训练一个SVM进行分类。注,作者测试使用softmax时mAP下降了3.3。 134 | * 位置精修(Bounding-box regression, 边框回归):提升了3-4mAP. 135 | 136 | ### 主要创新点 137 | * 将CNN应用于目标检测 138 | * 训练数据稀缺时,可以先从其他大的数据集进行预训练,然后在小数据集上进行微调(fine-tune) 139 | 140 | [返回顶部](#detector) 141 | 142 | ------ 143 | ## SPP-net 144 | [SPP-net](https://arxiv.org/abs/1406.4729) 145 | 利用空间金字塔池化,使得任意大小的特征图都能够转换成固定大小的特征向量。 146 | 从而解决了CNN的输入必须是固定尺寸的问题,实现了多尺度输入。 147 | 因此SPP-net只需对原图做一次卷积,节省了大量的计算时间,比[R-CNN](#r-cnn)有24~102倍的提速。 148 | 另外,SPP对分类性能也有帮助,获得了2014年imageNet挑战中检测的第二名和分类的第三名。 149 | 另外两个是VGG和GoogLeNet, 150 | 相关内容请参考[Classifier](https://github.com/mandeer/Classifier)工程。 151 | 152 | ### SPPNet structure 153 | ![SPP-net](./imgs/SPP-net.png) 154 | * 使用卷积网络提取特征:每幅图只做一次卷积,而不是每个候选区域做一次卷积运算。 155 | * 将候选区域映射到最后一层的feature map上,然后使用SPP得到固定长度的特征。 156 | * 使用SVM进行分类:同[R-CNN](#r-cnn) 157 | * 边框回归:同[R-CNN](#r-cnn) 158 | 159 | 160 | ### 主要创新点 161 | * 空间金字塔池化(spatial pyramid pooling, SPP):对每个bins使用全局最大值池化, 162 | 得到的特征仅于bins和feature map的个数有关,与feature map的尺寸无关。 163 | 从而解决了CNN的输入必须是固定尺寸的问题,实现了多尺度输入。 164 | * 多尺度输入的模型训练与测试方法:不同尺度输入的模型间参数共享。 165 | 166 | [返回顶部](#detector) 167 | 168 | ------ 169 | ## Fast 170 | [Fast R-CNN](https://arxiv.org/abs/1504.08083) 171 | 把类别判断和边框回归统一到了一个深度网络框架中,首次实现了end-to-end(proposal阶段除外)的训练。 172 | 173 | ### Fast R-CNN architecture 174 | ![Fast_R-CNN](./imgs/Fast_R-CNN.png) 175 | * 输入:整图及一系列候选区域 176 | * 使用卷积网络提取特征 177 | * RoI Pooling:为每个候选区域提取固定长度的特征。 178 | * 分类、边框回归 179 | 180 | ### 主要创新点 181 | * RoI pooling:仅有一层的[SPP](#spp-net)层,多尺度学习能提高一点点mAP,却成倍增加了计算量。 182 | * Fine-tuning方法--分层采样:解决了[R-CNN](#r-cnn)和[SPP-net](#spp-net)训练低效的问题。 183 | * Multi-task loss:Lcls & Lloc共享参数,mAP有约1%的提升。 184 | * Smooth_L1 Loss:比L1 loss更鲁棒,比L2 loss对离群点更不敏感。 185 | 186 | [返回顶部](#detector) 187 | 188 | ------ 189 | ## Faster 190 | [Faster R-CNN](https://arxiv.org/abs/1506.01497) 191 | 提出了RPN(Region Proposal Network), 终于将目标检测的四个基本步骤, 192 | 生成候选区域、特征提取、分类、边框回归统一到一个深度网络框架之中。 193 | Faster R-CNN的PyTorch代码可以参考 194 | [这里](https://github.com/chenyuntc/simple-faster-rcnn-pytorch) 195 | 196 | ### Faster R-CNN architecture 197 | ![Faster_R-CNN](./imgs/Faster_R-CNN.png) 198 | * 输入:整图 199 | * 通过RPN网络得到proposal boxes 200 | * 使用NMS(非最大值抑制)降低冗余 201 | * 检测class得分比较高的候选区域 202 | 203 | ### 主要创新点 204 | * Region Proposal Networks: 因为与Fast R-CNN共享特征,所以RPN几乎不消耗计算资源。 205 | 又因为RPN可以提高候选区域的质量,故提高了检出率。 206 | * 候选区域、锚点(Anchors): 多尺度锚点解决了待检测目标拥有不同尺度和宽高比例的问题。 207 | * RPN和Fast R-CNN共享特征的训练方法: 208 | * 从预训练模型W0开始,训练RPN,得到W1 209 | * 使用W1得到的候选区域及于训练模型W0,训练Fast R-CNN,得到W2 210 | * 使用W2,训练RPN,但固定前面的共享层,仅微调RPN独有的网络层,得到W3 211 | * 使用W3,训练Fast R-CNN,同样固定前面的共享层,仅训练Fast R-CNN独有的层,得到最终的W4 212 | * 重复上述过程得到的改进不大。 213 | 214 | [返回顶部](#detector) 215 | 216 | ------ 217 | ## FCN 218 | [FCN](https://arxiv.org/abs/1605.06211) 219 | 提出了一种end-to-end、pixels-to-pixels的语义分割(Semantic Segmentation)方法, 220 | 是将CNN结构应用到图像语义分割领域并取得突出结果的开山之作, 221 | 因而拿到了CVPR 2015年的best paper honorable mention. 222 | 223 | ### FCN architecture 224 | ![FCN](./imgs/FCN.png) 225 | * 使用语义分割的ground truth作为监督信息,训练了一个端到端、点到点的网络。 226 | 227 | ### 卷积化(convolutionalization) 228 | ![fc2Conv](./imgs/fc2Conv.png) 229 | * 将全连接层替换成卷积层,因此FCN可以接受任意尺寸的输入图像从而进行密集预测。 230 | 231 | ### 跳跃结构(skip layers) 232 | ![skip_layers](./imgs/skip_layers.png) 233 | * 使用反卷积(转置卷积)和跳跃结构,融合深层粗略的全局信息和浅层精细的局部信息。 234 | * 全局信息解决的“是什么”,而局部信息解决的是“在哪里” 235 | 236 | ### 主要创新点 237 | * 卷积化 238 | * 使用反卷积进行上采样 239 | * 使用跳跃结构融合深层和浅层的特征 240 | 241 | [返回顶部](#detector) 242 | 243 | ------ 244 | ## R-FCN 245 | [R-FCN](https://arxiv.org/abs/1605.06409) 246 | 使用位置敏感得分图(position-sensitive score maps), 247 | 解决了图像分类(平移不变性)和物体检测(平移变换性)两者间的矛盾, 248 | 从而解决了[Faster R-CNN](#faster)中部分卷积层无法共享计算的问题。 249 | 250 | ### R-FCN architecture 251 | ![R-FCN](./imgs/R-FCN.png) 252 | * 在ROI层之后,没有可学习的层,从而加快了训练和测试的速度 253 | 254 | ### position-sensitive RoI pooling 255 | ![pooling](./imgs/position-sensitive_RoI_pooling.png) 256 | 257 | ### 主要创新点 258 | * 位置敏感得分图 259 | * 使用1\*1的卷积得到K^2(C+1)维的位置敏感得分图 260 | * 位置敏感的ROI Pooling:具体的pooling细节参考上图 261 | * 投票(vote, 均值)得到C+1维的向量 262 | * softmax or bbox regression(4K^2) 263 | 264 | [返回顶部](#detector) 265 | 266 | ------ 267 | ## FPN 268 | [FPN](https://arxiv.org/abs/1612.03144) 269 | 提出了一种简单的在卷积网络内部构建特征金字塔的框架,即使卷积网络对目标的尺度变化有很强的鲁棒性, 270 | 特征金字塔仍然可以显著的改进原始网络的性能。 271 | 272 | 273 | ### FPN architecture 274 | ![FPN](./imgs/FPN.png) 275 | * Bottom-up pathway: 骨干网络,以卷积和降采样的方式提取特征 276 | * Top-down pathway: 上采样深层粗粒度特征,提高深层特征的分辨率 277 | * lateral connections: 融合浅层特征和深层特征 278 | 279 | ### 主要创新点 280 | * 特征金字塔:低层的特征语义信息比较少,但是目标位置准确; 281 | 高层的特征语义信息比较丰富,但是目标位置比较粗略。 282 | 二者联合得到了在不同分辨率都拥有丰富语义特征的特征金字塔。 283 | 284 | [返回顶部](#detector) 285 | 286 | ------ 287 | ## Mask 288 | [Mask R-CNN](https://arxiv.org/abs/1703.06870) 289 | 通过在[Faster R-CNN](#faster)基础上添加了一个用于预测目标掩模的新分支(mask branch), 290 | 在没有增加太多计算量,且没有使用各种trick的前提下,在COCO的一系列挑战任务 291 | (instance segmentation, object detection & person keypoint detection)中 292 | **都**取得了领先的结果。 293 | 作者开源了caffe2的[Mask R-CNN代码](https://github.com/facebookresearch/Detectron) 294 | 295 | ### 什么是实例分割 296 | ![Instance_segmentation](./imgs/Instance_segmentation.png) 297 | 298 | ### Mask R-CNN 框架 299 | ![Mask_R-CNN](./imgs/Mask_R-CNN.png) 300 | * 在Faster R-CNN的第二级上添加了与class和bbox并行的mask分支。 301 | * multi-task loss: L = Lcls + Lbox + Lmask 302 | 303 | ### ROIAlign 304 | ![ROIAlign](./imgs/ROIAlign.png) 305 | * 对feature map进行线性插值后再使用Pooling, 306 | ROIPooling的量化操作(rounding)会使mask与实际物体位置有一个微小的偏移(8 pixel) 307 | 308 | ### 主要创新点 309 | * mask分支:mask任务对分类和检测性能有帮助。 310 | * [ROIAlign](#roialign): ROI校准,解决了mask的偏移问题。同时对检测性能也有提升。 311 | * Lmask: 逐像素 sigmoid 的平均值,每类单独产生一个mask,依靠class分支获取类别标签。 312 | 将掩模预测和分类预测拆解,没有引入类间竞争,从而大幅提高了性能。 313 | 314 | [返回顶部](#detector) 315 | 316 | ------ 317 | ## MaskX 318 | [Learning to Segment Every Thing](https://arxiv.org/abs/1711.10370) 319 | 是指使用只有部分类别标注了mask label(但所有类别都标注了bbox label)的数据, 320 | 训练出可以分割所有类别(包括没有mask标注的类)的模型。 321 | 利用迁移学习的思想,通过在[Mask R-CNN](#mask)的原架构上添加了一个 322 | 权重传递函数(weight transfer function)实现了这一目标。 323 | 324 | ### 分割示例 325 | ![MaskX_show](./imgs/MaskX_show.png) 326 | * 图中绿框表示有mask标注的类,红框表示只有bbox标注的类 327 | * 可以很方便的从mask转换成bbox。反过来呢,提取的BBox特征是否对mask也有帮助? 328 | 329 | ### Mask^X R-CNN method 330 | ![MaskX](./imgs/MaskX.png) 331 | * 设A类有mask和bbox的标注,B类仅有bbox的标注 332 | * 使用A和B共同训练标准的目标检测(注意,A和B的训练需要是同质的) 333 | * 仅使用A训练mask和权重传递函数 334 | * 在推理时,权重传递函数用于预测每个类别的实例分割参数, 335 | 从而使模型能够分割所有目标的类别。 336 | 337 | ### 主要创新点 338 | * 开创了了一个令人兴奋的新的大规模实例分割的研究方向。 339 | * 权重传递函数:链接了bbox和mask,将box head的特征迁移到mask head中, 340 | 这样对于缺乏mask ground-truth的类别,只要有box ground-truth,依然可以进行有效分割。 341 | * 结合MLP和FCN:FCN更多的关注于细节,而MLP可以提取全局(主要)特征。 342 | 343 | [返回顶部](#detector) 344 | 345 | ------ 346 | ## DetectorNet 347 | [DetectorNet](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf) 348 | 将目标检测看做是一个回归问题,证明了基于DNN的目标掩码回归也可以捕捉到强烈的几何信息。 349 | 350 | ### DetectorNet框架 351 | ![DetectorNet](./imgs/DetectorNet.png) 352 | * 输出是full, left, right, top, bottom共5个mask, 其中后四个是半框, 图1仅显示了其中3个。 353 | * scale1: multi-scale + 滑动窗口(子窗口) 354 | * scale2: 精细化调整, bboxs放大1.2倍后,再跑一遍 355 | 356 | ### 主要创新点 357 | * 目标函数添加正则化约束: 解决正负标签不均衡问题。 358 | * 5个Mask: full, left, right, top, bottom 359 | * 解决目标重叠问题 360 | * multi-scale + 精细化调整(refinement): 361 | * 解决由于mask比较小, 无法精确定位的问题。 362 | 363 | [返回顶部](#detector) 364 | 365 | ------ 366 | ## DenseBox 367 | [DenseBox](https://arxiv.org/abs/1509.04874) 368 | 使用全卷积网络实现了end-to-end的目标检测。 369 | 370 | ### DenseBox 架构 371 | ![DenseBox](./imgs/DenseBox.png) 372 | * 输入: m\*n\*c的图像, 输出: m/4 \* n/4 \* 5的feature map; 373 | (5维分别表示与4条边的距离, 以及置信度) 374 | * 全卷积, 常规卷积网络之后执行上采样 375 | * 测试时使用图像金字塔输入 376 | 377 | ### 主要创新点 378 | * end-to-end的FCN实现目标检测 379 | * 仅使用包含目标(背景足够)的图像块进行训练(240的图像块,人脸占中间的50个像素) 380 | * 5通道输出, 每个像素都表示了一个对象 381 | * 多尺度特征融合 382 | * 样本均衡 383 | * 关键点检测任务有助于检测性能的提升 384 | 385 | [返回顶部](#detector) 386 | 387 | ------ 388 | ## MTCNN 389 | [MTCNN](https://arxiv.org/abs/1604.02878v1) 390 | 使用级联的CNN, 实现了实时(CPU)的人脸检测的人脸关键点的回归。 391 | 392 | ### MTCNN 级联架构 393 | ![MTCNN](./imgs/MTCNN.png) 394 | * 图像金字塔输入 395 | * PNet(Proposal, FCN, 滑动窗口): 浅层CNN, 快速生成候选窗口 396 | * RNet(Refine): 略复杂的CNN, 快速过滤候选窗口 397 | * ONet(Output): 强大的CNN, 输出bbox和关键点 398 | 399 | ### 主要创新点 400 | * 实时的人脸检测及关键点回归方案 401 | * 关键点检测有助于人脸检测的性能 402 | * Multi-source training 403 | * Online Hard sample mining 404 | 405 | [返回顶部](#detector) 406 | 407 | ------ 408 | ## FaceBoxes 409 | [FaceBoxes](https://arxiv.org/abs/1708.05234) 410 | 是另一种可以在CPU上做到实时的人脸检测算法, 且该算法的运行速度与人脸个数无关。 411 | 412 | ### FaceBoxes 网络结构 413 | ![FaceBoxes](./imgs/FaceBoxes.png) 414 | * RDCL: 快速降低feature map的大小, 以加快CNN前向运行加速 415 | * MSCL: 通过Inception模块和multi-scale feature maps获得不同大小的感受野 416 | 417 | ### 主要创新点 418 | * RDCL(Rapidly Digested ConvolutionalLayers): 419 | * Shrinking the spatial size of input: large stride sizes 420 | * Choosing suitable kernel size 421 | * Reducing the number of output channels: C.ReLU 422 | * MSCL(Multiple Scale Convolutional Layers): 423 | * Multi-scale design along the dimension of network depth: 424 | multi-scale feature maps 425 | * Multi-scale design along the dimension of network width: 426 | inception module 427 | * Anchor densification strategy: 428 | * 增加小人脸的采样密度 429 | 430 | [返回顶部](#detector) 431 | 432 | ------ 433 | ## YOLO 434 | [YOLO](https://arxiv.org/abs/1506.02640) 435 | 将目标检测任务看作目标区域预测和类别预测的回归问题, 436 | 采用单个神经网络直接预测目标边界和类别概率,实现端到端的目标检测, 437 | 是第一个基于CNN的实时通用目标检测系统。 438 | 439 | ### You Only Look Once 440 | ![YOLO](./imgs/YOLO.png) 441 | * 图像被分割成SxS个网格 442 | * 每个网格单元预测B(=2)个边界框 443 | * 这些预测被编码为S×S×(Bx5+C)的张量 444 | 445 | ### YOLO 损失函数 446 | ![YOLO_loss](./imgs/YOLO_loss.png) 447 | * lambda coord=5, lambda noobj=0.5: 增加了边界框坐标预测损失,减少不包含目标边界框的置信度预测损失 448 | * 边界框宽度和高度的平方根: 大盒子小偏差的重要性不如小盒子小偏差的重要性 449 | 450 | ### 主要创新点 451 | * end-to-end的实时通用目标检测系统: 将目标检测的流程统一为单个神经网络(one stage) 452 | * 隐式编码了上下文信息: 假阳性概率比Fast-RCNN低,可以提升Fast-RCNN 453 | * DackNet24网络架构 454 | * 目标检测通常需要细粒度的视觉信息,将输入分辨率从224×224变为448×448 455 | * 优化了平方和误差(Sum-squared error) 456 | 457 | ### 优缺点 458 | * 优点: 459 | * 速度快: Fast YOLO超过150fps 460 | * 背景误检(假阳性)的概率比Fast-RCNN低 461 | * 缺点: 462 | * 检出率相对较低,特别是(密集)小目标。 463 | * 容易产生定位错误 464 | 465 | [返回顶部](#detector) 466 | 467 | ------ 468 | ## YOLOv2 469 | [YOLOv2](https://arxiv.org/abs/1612.08242) 470 | 是对[YOLO](#yolo)的改进,在保持原有速度的同时提升了精度。 471 | 同时,作者还提出了一种目标分类与检测的联合训练方法, 472 | 同时在COCO和ImageNet数据集中进行训练得到**YOLO9000**, 473 | 实现了9000多类物体的实时检测。 474 | 475 | ### The path from YOLO to YOLOv2 476 | ![YOLOv2](./imgs/YOLOv2.png) 477 | 478 | ### WordTree 479 | ![YOLO9000](./imgs/YOLO9000.png) 480 | * 分层树: 在WordNet中, 大多数同义词只有一个路径, 481 | 因此首先把这条路径中的词全部都加到分层树中。 482 | 接着迭代地检查剩下的名词, 并尽可能少的把他们添加到分层树上, 483 | 添加的原则是取最短路径加入到树中。 484 | * 为什么没有采用多标签模型(multi-label model)?? 485 | 486 | ### 主要创新点 487 | * 对YOLO的一系列分析和改进: 参见 The path from YOLO to YOLOv2 488 | * Darknet-19 489 | * 预训练(224x224)后,使用高分辨率图像(448×448)对模型进行fine-tune 490 | * 每个author输出125维(5\*(5+20), VOC) 491 | * 可以根据需要调整检测准确率和检测速度 492 | * WordTree 493 | * 联合训练分类和检测: 494 | * 使用目标检测数据训练模型学习定位和检测目标; 再使用分类数据去扩展模型对多类别的识别能力。 495 | * 对于仅有类别信息的图像, 只会反向传播分类部分的损失 496 | * 使用WordTree可以把多个数据集整合在一起 497 | * WordTree 与 分层分类 与 multiple softmax 498 | * 计算某一结点的绝对概率,需要对这一结点到根节点的整条路径的所有概率进行相乘 499 | 500 | [返回顶部](#detector) 501 | 502 | ------ 503 | ## YOLOv3 504 | [YOLOv3](https://pjreddie.com/publications/) 505 | 对[YOLO](#yolo)又做了一些更新, 使其变的更好。 506 | 507 | ### 性能对比 508 | ![inference_YOLO](./imgs/inference_YOLO.png) 509 | 510 | ### Bounding boxes 511 | ![YOLO_Bbox](./imgs/YOLO_Bbox.png) 512 | 513 | ### The Deal 514 | * Bounding Box Prediction: 515 | * dimension clusters as anchor boxes 516 | * 每个bbox通过逻辑回归预测一个是否存在目标的得分 517 | * Class Prediction 518 | * using multi-label classification(binary cross-entropy loss) for each box 519 | * Predictions Across Scales 520 | * 3 different scales similar to [FPN](#fpn) 521 | * NxNx(3x(4+1+80)) per scale with COCO 522 | * 使用聚类得到9个bbox priors 523 | * Feature Extractor 524 | * Darknet-53 525 | 526 | ### 不足 527 | * AP50时表现非常好, 但是, 当IOU的阈值增加时, 效果不如[RetinaNet](#retinanet) 528 | * 大目标检测APl, 性能变差了 529 | 530 | [返回顶部](#detector) 531 | 532 | ------ 533 | ## SSD 534 | [SSD](https://arxiv.org/abs/1512.02325) 535 | 是另一个常用的基于CNN的实时通用目标检测系统, 且其速度快过[YOLO](#yolo), 536 | 精度与[Faster R-CNN](#faster)持平。 537 | 538 | ### SSD framework 539 | ![SSD](./imgs/SSD.png) 540 | * 默认框: 541 | * nulti-scale feature maps 542 | * each location 543 | * different aspect ratios 544 | * model loss = localization loss(Smooth L1) + confidence loss(Softmax) 545 | 546 | ### SSD model 547 | ![SSD_model](./imgs/SSD_model.png) 548 | * base network + auxiliary structure 549 | * Multi-scale feature maps for detection 550 | * (c+4)kmn outputs for a mxn feature map 551 | 552 | ### 主要创新点 553 | * 使用小卷积滤波器来预测特征图上固定的一组默框的类别分数和位置偏移 554 | * 使用不同尺度的特征图和不同的宽高比来检测不同大小和形状的目标 555 | * 速度快: 去掉了挑选候选框和之后的特征(或像素)重采样 556 | * Convolutional predictors for detection: YOLO使用的是全连接层 557 | * 每一个目标至少有一个默认框: 不同于MultiBox(每一个目标只有一个默认框) 558 | * Hard negative mining: 正负样本比为1:3 559 | 560 | ### 不足 561 | * 小目标检出率比较低 562 | * SSD有较小的定位误差, 但是易混淆相似类别的对象 563 | 564 | [返回顶部](#detector) 565 | 566 | ------ 567 | ## DSSD 568 | [DSSD](https://arxiv.org/abs/1701.06659) 569 | 通过使用反卷积增加了大量的上下文信息, 提高了上下文相关联目标的检出率, 570 | 且改善了原始[SSD](#ssd)对小目标检测效果不好的问题。 571 | 572 | ### DSSD vs. SSD 573 | ![DSSD](./imgs/DSSD.png) 574 | * 基准网络从 VGG 变成 Residual-101 575 | * 添加了Prediction module 和 Deconvolutional module 576 | 577 | ### 主要创新点 578 | * 更好的基准网络: Residual-101 579 | * Prediction module: 改善子任务的子网络 580 | * Deconvolution Module: 与[FPN](#fpn)的head略有不同 581 | * 通过改写卷积层的weight和bias去除Batch Norm操作 582 | 583 | [返回顶部](#detector) 584 | 585 | ------ 586 | ## FSSD 587 | [FSSD](https://arxiv.org/abs/1712.00960) 588 | 通过融合不同尺度的特征,在速度损失很少的情况下,大幅提升了性能。 589 | 590 | ### FSSD vs. SSD 591 | ![FSSD](./imgs/FSSD.png) 592 | * 使用双线性插值(bilinear interpolation)进行上采样 593 | * 使用串联(concatenation)的方式合并不同的特征图 594 | * 使用融合后的特征图构建特征金字塔 595 | 596 | ### 主要创新点 597 | * 新的特征融合框架: 不同与[DSSD](#dssd)和[FPN](#fpn) 598 | * 小目标检测优于SSD 599 | * 降低了检测出多个或部分物体(multi-part of one object)的概率 600 | 601 | [返回顶部](#detector) 602 | 603 | ------ 604 | ## ESSD 605 | [ESSD](https://arxiv.org/abs/1801.05918) 606 | 提出了一种新的不同尺度特征融合的方法,在尽量小损失速度的前提下,提升[SSD](#ssd)的精度。 607 | 608 | ### ESSD framework 609 | ![ESSD](./imgs/ESSD.png) 610 | * 仅前三个预测层使用了Extension module 611 | * Conv4_3, Conv7, Conv8_2 and Conv9_2 can receive gradient backpropagation from multiple layers 612 | ### Extension module 613 | ![Extension module](./imgs/Extension_module.png) 614 | 615 | ### 主要创新点 616 | * Extension module 617 | * 加权平均深度(Weighted average depth): 预测层的加权平均深度间的差异不应太大 618 | 619 | [返回顶部](#detector) 620 | 621 | ------ 622 | ## RFBNet 623 | [RFBNet](https://arxiv.org/abs/1711.07767) 624 | 借鉴人类视觉的感受野结构(Receptive Fields, RFs), 提出了RF Block (RFB) module, 625 | 然后将RFB module集成进了[SSD](#ssd)结构。 626 | 作者提供了[源码](https://github.com/ruinmessi/RFBNet) 627 | 628 | ### RFB module 629 | ![RFB_module](./imgs/RFB_module.png) 630 | * multiple branches with different kernels 631 | * 扩张卷积(空洞卷积) 632 | 633 | ### 主要创新点 634 | * RFB module 635 | * RFB Net: 嵌入了RFB模块的SSD 636 | 637 | 638 | [返回顶部](#detector) 639 | 640 | ------ 641 | ## DeformableConvNets 642 | [Deformable ConvNets](https://arxiv.org/abs/1703.06211) 643 | 提出了可变形卷积, 大大增强了CNN的几何变换建模能力。 644 | 证明了在CNN中学习密集的空间变换是可行和有效的。 645 | 646 | ### Deformable Conv 647 | ![DeformableConv](./imgs/DeformableConv.png) 648 | * 增加模块中的空间采样位置以及额外的偏移量,并且从目标任务中学习偏移量,且不需要额外的监督 649 | * 可变形卷积和与普通卷积有相同的输入和输出,可以很容易地进行替换 650 | * 可变形卷积能很容易地通过标准反向传播进行端到端的训练 651 | 652 | ### 可变形卷积的感受野 653 | ![receptive_field](./imgs/receptive_field.png) 654 | * 可变形卷积可以根据目标的尺寸和形状进行自适应调整 655 | * 增强了对非刚性物体的表达能力 656 | 657 | ### 主要创新点 658 | * deformable convolution 659 | * deformable RoI pooling 660 | 661 | [返回顶部](#detector) 662 | 663 | ------ 664 | ## DSOD 665 | [DSOD](https://arxiv.org/abs/1708.01241) 666 | 是首个从零开始学习并且获得高精度的目标检测算法。 667 | 668 | ### DSOD vs. SSD 669 | ![DSOD](./imgs/DSOD.png) 670 | * 骨干网络(backbone sub-network): DenseNets的变体 671 | * 前端子网(front-end sub-network): Dense Prediction Structure 672 | 673 | ### 使用预训练模型 674 | * 优点 675 | * 有许多公开发布的先进模型 676 | * 重用已训练好的模型更方便且节省训练时间 677 | * 缓解目标检测任务标注数据较少的问题 678 | * 缺点 679 | * 网络结构设计不够灵活 680 | * 学习偏差(Learning bias): 分类和检测任务之间的损失函数和类别分布都不相同 681 | * 域不匹配(Domain mismatch): 深度图像, 医学图像, 多光谱图像等 682 | 683 | ### 从0开始训练目标检测器的原则 684 | * Proposal-free: Roi Pooling阻碍了梯度的传播 685 | * Deep Supervision: dense layer-wise connection 686 | * Dense Prediction Structure: 687 | * Learning Half and Reusing Half 688 | 689 | [返回顶部](#detector) 690 | 691 | ------ 692 | ## RetinaNet 693 | [RetinaNet](https://arxiv.org/abs/1708.02002) 694 | 提出了Focal Loss, 降低分类清晰的样本损失的权重, 695 | 从而解决了one-stage检测器中正负样本失衡的问题。 696 | 本工程实现的RetinaNet主要参考了 697 | [这里](https://github.com/kuangliu/pytorch-retinanet) 698 | 699 | ### Focal Loss 700 | ![FocalLoss](./imgs/FocalLoss.png) 701 | * 动态缩放的交叉熵损失函数: 随着正确分类置信度的增加,函数中的比例因子逐渐缩减至零 702 | * 自动地减小简单样本的影响, 并快速地将模型的注意力集中在困难样本上 703 | * Focal Loss函数的确切形式并不重要 704 | * 这个曲线看起来不太好, 应该有其他比较好的函数表示?? 705 | 706 | ### RetinaNet 网络结构 707 | ![RetinaNet](./imgs/RetinaNet.png) 708 | * FPN Backbone 709 | * 使用密集的 Anchors: 9 anchors per level 710 | * 分类子网和框回归子网共享结构, 但却使用各自的参数 711 | 712 | ### 主要创新点 713 | * one-stage目标检测率低的很大一部分原因来自类别不平衡 714 | * Focal Loss: 解决了one-stage目标检测的类别不平衡问题 715 | * RetinaNet: 速度与精度 716 | 717 | [返回顶部](#detector) 718 | 719 | ------ 720 | ## Light-Head 721 | [Light-Head R-CNN](https://arxiv.org/abs/1711.07264) 722 | 通过精心的设计,在**速度**和精度上都超过了One-Stage Object Detector, 723 | 捍卫了Two-Stage Object Detector。 724 | 725 | ### Light-Head R-CNN 框架 726 | ![Light-Head](./imgs/Light-Head.png) 727 | * large separable convolution: k = 7 - 15 728 | * “thin” feature maps before RoI warping 729 | * a single fully-connected layer for prediction 730 | 731 | ### 主要创新点 732 | * Two-Stage Object Detector速度慢的原因 733 | * heavy-head: head 会运行很多次 734 | * single-stage detector也有缺点: 需要对每个anchor进行分类 735 | * Two-Stage Object Detector加速方法 736 | * 把ROI Pooling的feature map变得特别薄 737 | * 预测部分使用一个全连接层(Faster R-CNN是两个) 738 | 739 | [返回顶部](#detector) 740 | 741 | ------ 742 | 更多有关目标检测的论文,请参考 743 | [这里](https://handong1587.github.io/deep_learning/2015/10/09/object-detection.html) 744 | 想要查看VOC2012的排行榜请点击 745 | [这里](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=4) 746 | 747 | 748 | 749 | -------------------------------------------------------------------------------- /dataloader/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | from .dataset import DataSet 4 | from .getdataloader import get_data_loader 5 | from .en_decoder import RetinaBoxCoder -------------------------------------------------------------------------------- /dataloader/dataaugmentor.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import random 4 | from PIL import Image 5 | import torch 6 | 7 | 8 | class DataAugmentor(object): 9 | def __init__(self, imgSize=640): 10 | if isinstance(imgSize, (int, float)): 11 | self.imgW = imgSize 12 | self.imgH = imgSize 13 | else: 14 | self.imgW = imgSize[0] 15 | self.imgH = imgSize[1] 16 | 17 | def pad(self, img): 18 | '''Pad image with zeros to the specified size. 19 | 20 | Args: 21 | img: (PIL.Image) image to be padded. 22 | 23 | Returns: 24 | img: (PIL.Image) padded image. 25 | 26 | Reference: 27 | `tf.image.pad_to_bounding_box` 28 | ''' 29 | w, h = img.size 30 | canvas = Image.new('RGB', (self.imgW, self.imgH)) 31 | canvas.paste(img, (0, 0)) # paste on the left-up corner 32 | return canvas 33 | 34 | def random_flip(self, img, boxes): 35 | '''Randomly flip PIL image. 36 | 37 | If boxes is not None, flip boxes accordingly. 38 | 39 | Args: 40 | img: (PIL.Image) image to be flipped. 41 | boxes: (tensor) object boxes, sized [#obj,4]. 42 | 43 | Returns: 44 | img: (PIL.Image) randomly flipped image. 45 | boxes: (tensor) randomly flipped boxes. 46 | ''' 47 | if random.random() < 0.5: 48 | img = img.transpose(Image.FLIP_LEFT_RIGHT) 49 | w = img.width 50 | if boxes is not None: 51 | xmin = w - boxes[:, 2] 52 | xmax = w - boxes[:, 0] 53 | boxes[:, 0] = xmin 54 | boxes[:, 2] = xmax 55 | return img, boxes 56 | 57 | def resize(self, img, boxes, max_size=1000, random_interpolation=False): 58 | '''Resize the input PIL image to given size. 59 | 60 | If boxes is not None, resize boxes accordingly. 61 | 62 | Args: 63 | img: (PIL.Image) image to be resized. 64 | boxes: (tensor) object boxes, sized [#obj,4]. 65 | size: (tuple or int) 66 | - if is tuple, resize image to the size. 67 | - if is int, resize the shorter side to the size while maintaining the aspect ratio. 68 | max_size: (int) when size is int, limit the image longer size to max_size. 69 | This is essential to limit the usage of GPU memory. 70 | random_interpolation: (bool) randomly choose a resize interpolation method. 71 | 72 | Returns: 73 | img: (PIL.Image) resized image. 74 | boxes: (tensor) resized boxes. 75 | 76 | Example: 77 | >> img, boxes = resize(img, boxes, 600) # resize shorter side to 600 78 | >> img, boxes = resize(img, boxes, (500,600)) # resize image size to (500,600) 79 | >> img, _ = resize(img, None, (500,600)) # resize image only 80 | ''' 81 | w, h = img.size 82 | sw = float(self.imgW) / w 83 | sh = float(self.imgH) / h 84 | 85 | method = random.choice([ 86 | Image.BOX, 87 | Image.NEAREST, 88 | Image.HAMMING, 89 | Image.BICUBIC, 90 | Image.LANCZOS, 91 | Image.BILINEAR]) if random_interpolation else Image.BILINEAR 92 | img = img.resize((self.imgW, self.imgH), method) 93 | if boxes is not None: 94 | boxes = boxes * torch.FloatTensor([sw, sh, sw, sh]) 95 | return img, boxes 96 | -------------------------------------------------------------------------------- /dataloader/dataset.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | from __future__ import print_function 4 | 5 | import os 6 | import sys 7 | import random 8 | 9 | import torch 10 | import torch.utils.data as data 11 | import torchvision.transforms as transforms 12 | 13 | import numpy as np 14 | from PIL import Image 15 | 16 | 17 | class Label(object): 18 | def __init__(self): 19 | self.imgName = '' 20 | self.bboxes = [] 21 | self.labels = [] 22 | 23 | 24 | class DataSet(data.Dataset): 25 | ''' Load image, labels, boxes from a list file. 26 | The list file is like: 27 | a.jpg xmin ymin xmax ymax label xmin ymin xmax ymax label ... 28 | ''' 29 | def __init__(self, root, list_file, transform=None): 30 | ''' 31 | Args: 32 | root: (str) ditectory to images. 33 | list_file: (str/[str]) path to index file. 34 | transform: (function) image/box transforms. 35 | ''' 36 | self.root = root 37 | self.transform = transform 38 | self.dataes = [] 39 | 40 | if isinstance(list_file, list): 41 | # Cat multiple list files together. 42 | # This is especially useful for voc07/voc12 combination. 43 | tmp_file = '/tmp/listfile.txt' 44 | os.system('cat %s > %s' % (' '.join(list_file), tmp_file)) 45 | list_file = tmp_file 46 | 47 | with open(list_file) as file: 48 | lines = file.readlines() 49 | self.num_imgs = len(lines) 50 | 51 | for line in lines: 52 | data = Label() 53 | splited = line.strip().split() 54 | data.imgName = splited[0] 55 | num_boxes = (len(splited) - 1) // 5 56 | for i in range(num_boxes): 57 | xmin = splited[1+5*i] 58 | ymin = splited[2+5*i] 59 | xmax = splited[3+5*i] 60 | ymax = splited[4+5*i] 61 | c = splited[5+5*i] 62 | data.bboxes.append([float(xmin),float(ymin),float(xmax),float(ymax)]) 63 | data.labels.append(int(c)) 64 | self.dataes.append(data) 65 | 66 | def __getitem__(self, idx): 67 | ''' Load image. 68 | 69 | Args: 70 | idx: (int) image index. 71 | 72 | Returns: 73 | img: (tensor) image tensor. 74 | boxes: (tensor) bounding box targets. 75 | labels: (tensor) class label targets. 76 | ''' 77 | # Load image and boxes. 78 | data = self.dataes[idx] 79 | img = Image.open(os.path.join(self.root, data.imgName)).convert('RGB') 80 | 81 | boxes = torch.from_numpy(np.array(data.bboxes, dtype=np.float32)) 82 | labels = torch.from_numpy(np.array(data.labels, dtype=np.int64)) 83 | if self.transform: 84 | img, boxes, labels = self.transform(img, boxes, labels) 85 | return img, boxes, labels 86 | 87 | def __len__(self): 88 | return self.num_imgs 89 | 90 | 91 | if __name__ == '__main__': 92 | import cv2 93 | from PIL import ImageDraw 94 | root = '../datasets/voc/VOC2012/JPEGImages' 95 | list_file = '../datasets/voc/voc12_trainval.txt' 96 | dataset = DataSet(root, list_file) 97 | 98 | num = len(dataset) 99 | print('num: ', num) 100 | for i in range(num): 101 | img, boxes, labels = dataset[i] 102 | 103 | # image = transforms.ToPILImage(img) 104 | image = img 105 | imageDraw = ImageDraw.Draw(image) 106 | num_obj, _ = boxes.shape 107 | for j in range(num_obj): 108 | imageDraw.rectangle([boxes[j][0], boxes[j][1], boxes[j][2], boxes[j][3]], outline='red') 109 | image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR) 110 | cv2.imshow("OpenCV", image) 111 | cv2.waitKey(1000) -------------------------------------------------------------------------------- /dataloader/en_decoder.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | '''Encode object boxes and labels.''' 4 | import math 5 | import torch 6 | 7 | from dataloader.utils import meshgrid 8 | from dataloader.utils import box_iou, box_nms, change_box_order 9 | 10 | 11 | class RetinaBoxCoder(object): 12 | def __init__(self, imgSize=640): 13 | if isinstance(imgSize, (int, float)): 14 | self.imgW = imgSize 15 | self.imgH = imgSize 16 | else: 17 | self.imgW = imgSize[0] 18 | self.imgH = imgSize[1] 19 | self.anchor_areas = (32*32., 64*64., 128*128., 256*256., 512*512.) # p3 -> p7 20 | self.aspect_ratios = (1/2., 1/1., 2/1.) 21 | self.scale_ratios = (1., pow(2, 1/3.), pow(2, 2/3.)) 22 | self.anchor_boxes = self._get_anchor_boxes(imgSize=torch.FloatTensor([self.imgW, self.imgH])) # [x,y,x,y] 23 | 24 | def _get_anchor_wh(self): 25 | '''Compute anchor width and height for each feature map. 26 | 27 | Returns: 28 | anchor_wh: (tensor) anchor wh, sized [#fm, #anchors_per_cell, 2]. 29 | ''' 30 | anchor_wh = [] 31 | for s in self.anchor_areas: 32 | for ar in self.aspect_ratios: # w/h = ar 33 | h = math.sqrt(s/ar) 34 | w = ar * h 35 | for sr in self.scale_ratios: # scale 36 | anchor_h = h*sr 37 | anchor_w = w*sr 38 | anchor_wh.append([anchor_w, anchor_h]) 39 | num_fms = len(self.anchor_areas) 40 | return torch.Tensor(anchor_wh).view(num_fms, -1, 2) 41 | 42 | def _get_anchor_boxes(self, imgSize): 43 | '''Compute anchor boxes for each feature map. 44 | 45 | Args: 46 | imgSize: (tensor) model input size of (w,h). 47 | 48 | Returns: 49 | boxes: (list) anchor boxes for each feature map. Each of size [#anchors,4], 50 | where #anchors = fmw * fmh * #anchors_per_cell 51 | ''' 52 | num_fms = len(self.anchor_areas) 53 | anchor_wh = self._get_anchor_wh() 54 | fm_sizes = [(imgSize/pow(2.0, i+3)).ceil() for i in range(num_fms)] # p3 -> p7 feature map sizes 55 | 56 | boxes = [] 57 | for i in range(num_fms): 58 | fm_size = fm_sizes[i] 59 | grid_size = imgSize / fm_size 60 | fm_w, fm_h = int(fm_size[0]), int(fm_size[1]) 61 | xy = meshgrid(fm_w, fm_h) + 0.5 # [fm_h*fm_w, 2] 62 | xy = (xy*grid_size).view(fm_h, fm_w, 1, 2).expand(fm_h, fm_w, 9, 2) 63 | wh = anchor_wh[i].view(1,1,9,2).expand(fm_h,fm_w,9,2) 64 | box = torch.cat([xy-wh/2.0, xy+wh/2.0], 3) # [x,y,x,y] 65 | boxes.append(box.view(-1,4)) 66 | return torch.cat(boxes, 0) 67 | 68 | def encode(self, boxes, labels): 69 | '''Encode target bounding boxes and class labels. 70 | 71 | We obey the Faster RCNN box coder: 72 | tx = (x - anchor_x) / anchor_w 73 | ty = (y - anchor_y) / anchor_h 74 | tw = log(w / anchor_w) 75 | th = log(h / anchor_h) 76 | 77 | Args: 78 | boxes: (tensor) bounding boxes of (xmin,ymin,xmax,ymax), sized [#obj, 4]. 79 | labels: (tensor) object class labels, sized [#obj,]. 80 | 81 | Returns: 82 | loc_targets: (tensor) encoded bounding boxes, sized [#anchors,4]. 83 | cls_targets: (tensor) encoded class labels, sized [#anchors,]. 84 | ''' 85 | anchor_boxes = self.anchor_boxes 86 | ious = box_iou(anchor_boxes, boxes) 87 | max_ious, max_ids = ious.max(1) # 每个默认窗口最大iou对应的boxes 88 | boxes = boxes[max_ids] 89 | 90 | boxes = change_box_order(boxes, 'xyxy2xywh') 91 | anchor_boxes = change_box_order(anchor_boxes, 'xyxy2xywh') 92 | 93 | loc_xy = (boxes[:, :2] - anchor_boxes[:, :2]) / anchor_boxes[:, 2:] 94 | loc_wh = torch.log(boxes[:, 2:] / anchor_boxes[:, 2:]) 95 | loc_targets = torch.cat([loc_xy, loc_wh], 1) 96 | cls_targets = 1 + labels[max_ids] 97 | 98 | cls_targets[max_ious<0.5] = 0 99 | ignore = (max_ious>0.3) & (max_ious<0.5) # ignore ious between [0.3,0.5] 100 | cls_targets[ignore] = -1 # mark ignored to -1 101 | return loc_targets, cls_targets 102 | 103 | def decode(self, loc_preds, cls_preds, input_size): 104 | '''Decode outputs back to bouding box locations and class labels. 105 | 106 | Args: 107 | loc_preds: (tensor) predicted locations, sized [#anchors, 4]. 108 | cls_preds: (tensor) predicted class labels, sized [#anchors, #classes]. 109 | input_size: (tuple) model input size of (w,h). 110 | 111 | Returns: 112 | boxes: (tensor) decode box locations, sized [#obj,4]. 113 | labels: (tensor) class labels for each box, sized [#obj,]. 114 | ''' 115 | CLS_THRESH = 0.5 116 | NMS_THRESH = 0.5 117 | 118 | input_size = torch.FloatTensor(input_size) 119 | anchor_boxes = self._get_anchor_boxes(input_size) # xywh 120 | 121 | loc_xy = loc_preds[:, :2] 122 | loc_wh = loc_preds[:, 2:] 123 | 124 | xy = loc_xy * anchor_boxes[:, 2:] + anchor_boxes[:, :2] 125 | wh = loc_wh.exp() * anchor_boxes[:, 2:] 126 | boxes = torch.cat([xy-wh/2, xy+wh/2], 1) # [#anchors,4] 127 | 128 | score, labels = cls_preds.sigmoid().max(1) # [#anchors,] 129 | ids = score > CLS_THRESH 130 | ids = ids.nonzero().squeeze() # [#obj,] 131 | if (len(ids) == 0): 132 | max = torch.max(score) 133 | ids = score >= max 134 | ids = ids.nonzero().squeeze() 135 | keep = box_nms(boxes[ids], score[ids], threshold=NMS_THRESH) 136 | return boxes[ids][keep], labels[ids][keep], score[ids][keep] 137 | -------------------------------------------------------------------------------- /dataloader/getdataloader.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import torch 4 | import torchvision.transforms as transforms 5 | 6 | from dataloader.dataset import DataSet 7 | from dataloader.en_decoder import RetinaBoxCoder 8 | from dataloader.dataaugmentor import DataAugmentor 9 | 10 | 11 | box_coder = RetinaBoxCoder(imgSize=640) 12 | dataugmentor = DataAugmentor(imgSize=640) 13 | 14 | def transform_train(img, boxes, labels): 15 | img, boxes = dataugmentor.random_flip(img, boxes) 16 | img, boxes = dataugmentor.resize(img, boxes) 17 | img = dataugmentor.pad(img) 18 | img = transforms.Compose([ 19 | transforms.ToTensor(), 20 | transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) 21 | ])(img) 22 | boxes, labels = box_coder.encode(boxes, labels) 23 | return img, boxes, labels 24 | 25 | 26 | def transform_test(img, boxes, labels): 27 | img, boxes = dataugmentor.resize(img, boxes) 28 | img = dataugmentor.pad(img) 29 | img = transforms.Compose([ 30 | transforms.ToTensor(), 31 | transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) 32 | ])(img) 33 | boxes, labels = box_coder.encode(boxes, labels) 34 | return img, boxes, labels 35 | 36 | 37 | def get_data_loader(conf): 38 | trainset = DataSet(root=conf.train_root, 39 | list_file=conf.train_label_file, 40 | transform=transform_train) 41 | 42 | testset = DataSet(root=conf.test_root, 43 | list_file=conf.test_label_file, 44 | transform=transform_test) 45 | 46 | trainLoader = torch.utils.data.DataLoader(trainset, batch_size=conf.batch_size, shuffle=True, num_workers=conf.n_workers) 47 | testLoader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False, num_workers=conf.n_workers) 48 | 49 | return trainLoader, testLoader 50 | 51 | 52 | if __name__ == '__main__': 53 | import argparse 54 | parser = argparse.ArgumentParser(description='getDataLoader test') 55 | 56 | config = parser.parse_args() 57 | config.train_root = '../datasets/voc/VOC2007/JPEGImages' 58 | config.train_label_file = '../datasets/voc/voc07_trainval.txt' 59 | config.test_root = '../datasets/voc/VOC2007/JPEGImages' 60 | config.test_label_file = '../datasets/voc//voc07_test.txt' 61 | config.batch_size = 1 62 | config.n_workers = 4 63 | 64 | detransforms = transforms.Compose([ 65 | transforms.Normalize((-0.485/0.229, -0.456/0.224, -0.406/0.225), (1/0.229, 1/0.224, 1/0.225)), 66 | transforms.ToPILImage(), 67 | ]) 68 | 69 | args = vars(config) 70 | print('------------ Options -------------') 71 | for key, value in sorted(args.items()): 72 | print('%16.16s: %16.16s' % (str(key), str(value))) 73 | print('-------------- End ----------------') 74 | 75 | trainLoader, testLoader = get_data_loader(config) 76 | print('train samples num: ', len(trainLoader), ' test samples num: ', len(testLoader)) 77 | 78 | import cv2 79 | import numpy as np 80 | for ii, (img, boxes, labels) in enumerate(trainLoader): 81 | img = detransforms(img[0]) 82 | W, H = img.size 83 | 84 | image = cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR) 85 | cv2.imshow("OpenCV", image) 86 | cv2.waitKey(1000) 87 | print(boxes.shape) 88 | print(labels.shape) 89 | print(ii) 90 | -------------------------------------------------------------------------------- /dataloader/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def change_box_order(boxes, order): 5 | '''Change box order between (xmin,ymin,xmax,ymax) and (xcenter,ycenter,width,height). 6 | 7 | Args: 8 | boxes: (tensor) bounding boxes, sized [N,4]. 9 | order: (str) either 'xyxy2xywh' or 'xywh2xyxy'. 10 | 11 | Returns: 12 | (tensor) converted bounding boxes, sized [N,4]. 13 | ''' 14 | assert order in ['xyxy2xywh','xywh2xyxy'] 15 | a = boxes[:,:2] 16 | b = boxes[:,2:] 17 | if order == 'xyxy2xywh': 18 | return torch.cat([(a+b)/2,b-a], 1) 19 | return torch.cat([a-b/2,a+b/2], 1) 20 | 21 | 22 | def box_clamp(boxes, xmin, ymin, xmax, ymax): 23 | '''Clamp boxes. 24 | 25 | Args: 26 | boxes: (tensor) bounding boxes of (xmin,ymin,xmax,ymax), sized [N,4]. 27 | xmin: (number) min value of x. 28 | ymin: (number) min value of y. 29 | xmax: (number) max value of x. 30 | ymax: (number) max value of y. 31 | 32 | Returns: 33 | (tensor) clamped boxes. 34 | ''' 35 | boxes[:,0].clamp_(min=xmin, max=xmax) 36 | boxes[:,1].clamp_(min=ymin, max=ymax) 37 | boxes[:,2].clamp_(min=xmin, max=xmax) 38 | boxes[:,3].clamp_(min=ymin, max=ymax) 39 | return boxes 40 | 41 | 42 | def box_select(boxes, xmin, ymin, xmax, ymax): 43 | '''Select boxes in range (xmin,ymin,xmax,ymax). 44 | 45 | Args: 46 | boxes: (tensor) bounding boxes of (xmin,ymin,xmax,ymax), sized [N,4]. 47 | xmin: (number) min value of x. 48 | ymin: (number) min value of y. 49 | xmax: (number) max value of x. 50 | ymax: (number) max value of y. 51 | 52 | Returns: 53 | (tensor) selected boxes, sized [M,4]. 54 | (tensor) selected mask, sized [N,]. 55 | ''' 56 | mask = (boxes[:,0]>=xmin) & (boxes[:,1]>=ymin) \ 57 | & (boxes[:,2]<=xmax) & (boxes[:,3]<=ymax) 58 | boxes = boxes[mask,:] 59 | return boxes, mask 60 | 61 | 62 | def box_iou(box1, box2): 63 | '''Compute the intersection over union of two set of boxes. 64 | 65 | The box order must be (xmin, ymin, xmax, ymax). 66 | 67 | Args: 68 | box1: (tensor) bounding boxes, sized [N,4]. 69 | box2: (tensor) bounding boxes, sized [M,4]. 70 | 71 | Return: 72 | (tensor) iou, sized [N,M]. 73 | 74 | Reference: 75 | https://github.com/chainer/chainercv/blob/master/chainercv/utils/bbox/bbox_iou.py 76 | ''' 77 | N = box1.size(0) 78 | M = box2.size(0) 79 | 80 | lt = torch.max(box1[:,None,:2], box2[:,:2]) # [N,M,2] 81 | rb = torch.min(box1[:,None,2:], box2[:,2:]) # [N,M,2] 82 | 83 | wh = (rb-lt).clamp(min=0) # [N,M,2] 84 | inter = wh[:,:,0] * wh[:,:,1] # [N,M] 85 | 86 | area1 = (box1[:,2]-box1[:,0]) * (box1[:,3]-box1[:,1]) # [N,] 87 | area2 = (box2[:,2]-box2[:,0]) * (box2[:,3]-box2[:,1]) # [M,] 88 | iou = inter / (area1[:,None] + area2 - inter) 89 | return iou 90 | 91 | 92 | def box_nms(bboxes, scores, threshold=0.5): 93 | '''Non maximum suppression. 94 | 95 | Args: 96 | bboxes: (tensor) bounding boxes, sized [N,4]. 97 | scores: (tensor) confidence scores, sized [N,]. 98 | threshold: (float) overlap threshold. 99 | 100 | Returns: 101 | keep: (tensor) selected indices. 102 | 103 | Reference: 104 | https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/nms/py_cpu_nms.py 105 | ''' 106 | x1 = bboxes[:,0] 107 | y1 = bboxes[:,1] 108 | x2 = bboxes[:,2] 109 | y2 = bboxes[:,3] 110 | 111 | areas = (x2-x1) * (y2-y1) 112 | _, order = scores.sort(0, descending=True) 113 | 114 | keep = [] 115 | while order.numel() > 0: 116 | i = order[0] 117 | keep.append(i) 118 | 119 | if order.numel() == 1: 120 | break 121 | 122 | xx1 = x1[order[1:]].clamp(min=x1[i]) 123 | yy1 = y1[order[1:]].clamp(min=y1[i]) 124 | xx2 = x2[order[1:]].clamp(max=x2[i]) 125 | yy2 = y2[order[1:]].clamp(max=y2[i]) 126 | 127 | w = (xx2-xx1).clamp(min=0) 128 | h = (yy2-yy1).clamp(min=0) 129 | inter = w * h 130 | 131 | overlap = inter / (areas[i] + areas[order[1:]] - inter) 132 | ids = (overlap<=threshold).nonzero().squeeze() 133 | if ids.numel() == 0: 134 | break 135 | order = order[ids+1] 136 | return torch.LongTensor(keep) 137 | 138 | 139 | def meshgrid(x, y, row_major=True): 140 | '''Return meshgrid in range x & y. 141 | 142 | Args: 143 | x: (int) first dim range. 144 | y: (int) second dim range. 145 | row_major: (bool) row major or column major. 146 | 147 | Returns: 148 | (tensor) meshgrid, sized [x*y,2] 149 | 150 | Example: 151 | >> meshgrid(3,2) 152 | 0 0 153 | 1 0 154 | 2 0 155 | 0 1 156 | 1 1 157 | 2 1 158 | [torch.FloatTensor of size 6x2] 159 | 160 | >> meshgrid(3,2,row_major=False) 161 | 0 0 162 | 0 1 163 | 0 2 164 | 1 0 165 | 1 1 166 | 1 2 167 | [torch.FloatTensor of size 6x2] 168 | ''' 169 | a = torch.arange(0,x) 170 | b = torch.arange(0,y) 171 | xx = a.repeat(y).view(-1,1) 172 | yy = b.view(-1,1).repeat(1,x).view(-1,1) 173 | return torch.cat([xx,yy],1) if row_major else torch.cat([yy,xx],1) 174 | -------------------------------------------------------------------------------- /evaluations/__init__.py: -------------------------------------------------------------------------------- 1 | from evaluations.voc_eval import voc_eval 2 | -------------------------------------------------------------------------------- /evaluations/voc_eval.py: -------------------------------------------------------------------------------- 1 | '''Compute PASCAL_VOC MAP. 2 | 3 | Reference: 4 | https://github.com/chainer/chainercv/blob/master/chainercv/evaluations/eval_detection_voc.py 5 | ''' 6 | from __future__ import division 7 | 8 | import six 9 | import itertools 10 | import numpy as np 11 | 12 | from collections import defaultdict 13 | 14 | 15 | def voc_eval(pred_bboxes, pred_labels, pred_scores, gt_bboxes, gt_labels, 16 | gt_difficults=None, iou_thresh=0.5, use_07_metric=True): 17 | '''Wrap VOC evaluation for PyTorch.''' 18 | pred_bboxes = [xy2yx(b).numpy() for b in pred_bboxes] 19 | pred_labels = [label.numpy() for label in pred_labels] 20 | pred_scores = [score.numpy() for score in pred_scores] 21 | gt_bboxes = [xy2yx(b).numpy() for b in gt_bboxes] 22 | gt_labels = [label.numpy() for label in gt_labels] 23 | return eval_detection_voc( 24 | pred_bboxes, pred_labels, pred_scores, gt_bboxes, 25 | gt_labels, gt_difficults, iou_thresh, use_07_metric) 26 | 27 | def xy2yx(boxes): 28 | '''Convert box (xmin,ymin,xmax,ymax) to (ymin,xmin,ymax,xmax).''' 29 | c0 = boxes[:,0].clone() 30 | c2 = boxes[:,2].clone() 31 | boxes[:,0] = boxes[:,1] 32 | boxes[:,1] = c0 33 | boxes[:,2] = boxes[:,3] 34 | boxes[:,3] = c2 35 | return boxes 36 | 37 | def bbox_iou(bbox_a, bbox_b): 38 | '''Calculate the Intersection of Unions (IoUs) between bounding boxes. 39 | 40 | Args: 41 | bbox_a (array): An array whose shape is :math:`(N, 4)`. 42 | :math:`N` is the number of bounding boxes. 43 | The dtype should be :obj:`numpy.float32`. 44 | bbox_b (array): An array similar to :obj:`bbox_a`, 45 | whose shape is :math:`(K, 4)`. 46 | The dtype should be :obj:`numpy.float32`. 47 | 48 | Returns: 49 | array: 50 | An array whose shape is :math:`(N, K)`. \ 51 | An element at index :math:`(n, k)` contains IoUs between \ 52 | :math:`n` th bounding box in :obj:`bbox_a` and :math:`k` th bounding \ 53 | box in :obj:`bbox_b`. 54 | ''' 55 | # top left 56 | tl = np.maximum(bbox_a[:, None, :2], bbox_b[:, :2]) 57 | # bottom right 58 | br = np.minimum(bbox_a[:, None, 2:], bbox_b[:, 2:]) 59 | 60 | area_i = np.prod(br - tl, axis=2) * (tl < br).all(axis=2) 61 | area_a = np.prod(bbox_a[:, 2:] - bbox_a[:, :2], axis=1) 62 | area_b = np.prod(bbox_b[:, 2:] - bbox_b[:, :2], axis=1) 63 | return area_i / (area_a[:, None] + area_b - area_i) 64 | 65 | def eval_detection_voc( 66 | pred_bboxes, pred_labels, pred_scores, gt_bboxes, gt_labels, 67 | gt_difficults=None, 68 | iou_thresh=0.5, use_07_metric=False): 69 | """Calculate average precisions based on evaluation code of PASCAL VOC. 70 | 71 | This function evaluates predicted bounding boxes obtained from a dataset 72 | which has :math:`N` images by using average precision for each class. 73 | The code is based on the evaluation code used in PASCAL VOC Challenge. 74 | 75 | Args: 76 | pred_bboxes (iterable of numpy.ndarray): An iterable of :math:`N` 77 | sets of bounding boxes. 78 | Its index corresponds to an index for the base dataset. 79 | Each element of :obj:`pred_bboxes` is a set of coordinates 80 | of bounding boxes. This is an array whose shape is :math:`(R, 4)`, 81 | where :math:`R` corresponds 82 | to the number of bounding boxes, which may vary among boxes. 83 | The second axis corresponds to 84 | :math:`y_{min}, x_{min}, y_{max}, x_{max}` of a bounding box. 85 | pred_labels (iterable of numpy.ndarray): An iterable of labels. 86 | Similar to :obj:`pred_bboxes`, its index corresponds to an 87 | index for the base dataset. Its length is :math:`N`. 88 | pred_scores (iterable of numpy.ndarray): An iterable of confidence 89 | scores for predicted bounding boxes. Similar to :obj:`pred_bboxes`, 90 | its index corresponds to an index for the base dataset. 91 | Its length is :math:`N`. 92 | gt_bboxes (iterable of numpy.ndarray): An iterable of ground truth 93 | bounding boxes 94 | whose length is :math:`N`. An element of :obj:`gt_bboxes` is a 95 | bounding box whose shape is :math:`(R, 4)`. Note that the number of 96 | bounding boxes in each image does not need to be same as the number 97 | of corresponding predicted boxes. 98 | gt_labels (iterable of numpy.ndarray): An iterable of ground truth 99 | labels which are organized similarly to :obj:`gt_bboxes`. 100 | gt_difficults (iterable of numpy.ndarray): An iterable of boolean 101 | arrays which is organized similarly to :obj:`gt_bboxes`. 102 | This tells whether the 103 | corresponding ground truth bounding box is difficult or not. 104 | By default, this is :obj:`None`. In that case, this function 105 | considers all bounding boxes to be not difficult. 106 | iou_thresh (float): A prediction is correct if its Intersection over 107 | Union with the ground truth is above this value. 108 | use_07_metric (bool): Whether to use PASCAL VOC 2007 evaluation metric 109 | for calculating average precision. The default value is 110 | :obj:`False`. 111 | 112 | Returns: 113 | dict: 114 | 115 | The keys, value-types and the description of the values are listed 116 | below. 117 | 118 | * **ap** (*numpy.ndarray*): An array of average precisions. \ 119 | The :math:`l`-th value corresponds to the average precision \ 120 | for class :math:`l`. If class :math:`l` does not exist in \ 121 | either :obj:`pred_labels` or :obj:`gt_labels`, the corresponding \ 122 | value is set to :obj:`numpy.nan`. 123 | * **map** (*float*): The average of Average Precisions over classes. 124 | 125 | """ 126 | 127 | prec, rec = calc_detection_voc_prec_rec( 128 | pred_bboxes, pred_labels, pred_scores, 129 | gt_bboxes, gt_labels, gt_difficults, 130 | iou_thresh=iou_thresh) 131 | 132 | ap = calc_detection_voc_ap(prec, rec, use_07_metric=use_07_metric) 133 | 134 | return {'ap': ap, 'map': np.nanmean(ap)} 135 | 136 | 137 | def calc_detection_voc_prec_rec( 138 | pred_bboxes, pred_labels, pred_scores, gt_bboxes, gt_labels, 139 | gt_difficults=None, 140 | iou_thresh=0.5): 141 | """Calculate precision and recall based on evaluation code of PASCAL VOC. 142 | 143 | This function calculates precision and recall of 144 | predicted bounding boxes obtained from a dataset which has :math:`N` 145 | images. 146 | The code is based on the evaluation code used in PASCAL VOC Challenge. 147 | 148 | Args: 149 | pred_bboxes (iterable of numpy.ndarray): An iterable of :math:`N` 150 | sets of bounding boxes. 151 | Its index corresponds to an index for the base dataset. 152 | Each element of :obj:`pred_bboxes` is a set of coordinates 153 | of bounding boxes. This is an array whose shape is :math:`(R, 4)`, 154 | where :math:`R` corresponds 155 | to the number of bounding boxes, which may vary among boxes. 156 | The second axis corresponds to 157 | :math:`y_{min}, x_{min}, y_{max}, x_{max}` of a bounding box. 158 | pred_labels (iterable of numpy.ndarray): An iterable of labels. 159 | Similar to :obj:`pred_bboxes`, its index corresponds to an 160 | index for the base dataset. Its length is :math:`N`. 161 | pred_scores (iterable of numpy.ndarray): An iterable of confidence 162 | scores for predicted bounding boxes. Similar to :obj:`pred_bboxes`, 163 | its index corresponds to an index for the base dataset. 164 | Its length is :math:`N`. 165 | gt_bboxes (iterable of numpy.ndarray): An iterable of ground truth 166 | bounding boxes 167 | whose length is :math:`N`. An element of :obj:`gt_bboxes` is a 168 | bounding box whose shape is :math:`(R, 4)`. Note that the number of 169 | bounding boxes in each image does not need to be same as the number 170 | of corresponding predicted boxes. 171 | gt_labels (iterable of numpy.ndarray): An iterable of ground truth 172 | labels which are organized similarly to :obj:`gt_bboxes`. 173 | gt_difficults (iterable of numpy.ndarray): An iterable of boolean 174 | arrays which is organized similarly to :obj:`gt_bboxes`. 175 | This tells whether the 176 | corresponding ground truth bounding box is difficult or not. 177 | By default, this is :obj:`None`. In that case, this function 178 | considers all bounding boxes to be not difficult. 179 | iou_thresh (float): A prediction is correct if its Intersection over 180 | Union with the ground truth is above this value.. 181 | 182 | Returns: 183 | tuple of two lists: 184 | This function returns two lists: :obj:`prec` and :obj:`rec`. 185 | 186 | * :obj:`prec`: A list of arrays. :obj:`prec[l]` is precision \ 187 | for class :math:`l`. If class :math:`l` does not exist in \ 188 | either :obj:`pred_labels` or :obj:`gt_labels`, :obj:`prec[l]` is \ 189 | set to :obj:`None`. 190 | * :obj:`rec`: A list of arrays. :obj:`rec[l]` is recall \ 191 | for class :math:`l`. If class :math:`l` that is not marked as \ 192 | difficult does not exist in \ 193 | :obj:`gt_labels`, :obj:`rec[l]` is \ 194 | set to :obj:`None`. 195 | 196 | """ 197 | 198 | pred_bboxes = iter(pred_bboxes) 199 | pred_labels = iter(pred_labels) 200 | pred_scores = iter(pred_scores) 201 | gt_bboxes = iter(gt_bboxes) 202 | gt_labels = iter(gt_labels) 203 | if gt_difficults is None: 204 | gt_difficults = itertools.repeat(None) 205 | else: 206 | gt_difficults = iter(gt_difficults) 207 | 208 | n_pos = defaultdict(int) 209 | score = defaultdict(list) 210 | match = defaultdict(list) 211 | 212 | for pred_bbox, pred_label, pred_score, gt_bbox, gt_label, gt_difficult in \ 213 | six.moves.zip( 214 | pred_bboxes, pred_labels, pred_scores, 215 | gt_bboxes, gt_labels, gt_difficults): 216 | 217 | if gt_difficult is None: 218 | gt_difficult = np.zeros(gt_bbox.shape[0], dtype=bool) 219 | 220 | for l in np.unique(np.concatenate((pred_label, gt_label)).astype(int)): 221 | pred_mask_l = pred_label == l 222 | pred_bbox_l = pred_bbox[pred_mask_l] 223 | pred_score_l = pred_score[pred_mask_l] 224 | # sort by score 225 | order = pred_score_l.argsort()[::-1] 226 | pred_bbox_l = pred_bbox_l[order] 227 | pred_score_l = pred_score_l[order] 228 | 229 | gt_mask_l = gt_label == l 230 | gt_bbox_l = gt_bbox[gt_mask_l] 231 | gt_difficult_l = np.array(gt_difficult)[gt_mask_l] 232 | 233 | n_pos[l] += np.logical_not(gt_difficult_l).sum() 234 | score[l].extend(pred_score_l) 235 | 236 | if len(pred_bbox_l) == 0: 237 | continue 238 | if len(gt_bbox_l) == 0: 239 | match[l].extend((0,) * pred_bbox_l.shape[0]) 240 | continue 241 | 242 | # VOC evaluation follows integer typed bounding boxes. 243 | pred_bbox_l = pred_bbox_l.copy() 244 | pred_bbox_l[:, 2:] += 1 245 | gt_bbox_l = gt_bbox_l.copy() 246 | gt_bbox_l[:, 2:] += 1 247 | 248 | iou = bbox_iou(pred_bbox_l, gt_bbox_l) 249 | gt_index = iou.argmax(axis=1) 250 | # set -1 if there is no matching ground truth 251 | gt_index[iou.max(axis=1) < iou_thresh] = -1 252 | del iou 253 | 254 | selec = np.zeros(gt_bbox_l.shape[0], dtype=bool) 255 | for gt_idx in gt_index: 256 | if gt_idx >= 0: 257 | if gt_difficult_l[gt_idx]: 258 | match[l].append(-1) 259 | else: 260 | if not selec[gt_idx]: 261 | match[l].append(1) 262 | else: 263 | match[l].append(0) 264 | selec[gt_idx] = True 265 | else: 266 | match[l].append(0) 267 | 268 | for iter_ in ( 269 | pred_bboxes, pred_labels, pred_scores, 270 | gt_bboxes, gt_labels, gt_difficults): 271 | if next(iter_, None) is not None: 272 | raise ValueError('Length of input iterables need to be same.') 273 | 274 | n_fg_class = max(n_pos.keys()) + 1 275 | prec = [None] * n_fg_class 276 | rec = [None] * n_fg_class 277 | 278 | for l in n_pos.keys(): 279 | score_l = np.array(score[l]) 280 | match_l = np.array(match[l], dtype=np.int8) 281 | 282 | order = score_l.argsort()[::-1] 283 | match_l = match_l[order] 284 | 285 | tp = np.cumsum(match_l == 1) 286 | fp = np.cumsum(match_l == 0) 287 | 288 | # If an element of fp + tp is 0, 289 | # the corresponding element of prec[l] is nan. 290 | prec[l] = tp / (fp + tp) 291 | # If n_pos[l] is 0, rec[l] is None. 292 | if n_pos[l] > 0: 293 | rec[l] = tp / n_pos[l] 294 | 295 | return prec, rec 296 | 297 | 298 | def calc_detection_voc_ap(prec, rec, use_07_metric=False): 299 | """Calculate average precisions based on evaluation code of PASCAL VOC. 300 | 301 | This function calculates average precisions 302 | from given precisions and recalls. 303 | The code is based on the evaluation code used in PASCAL VOC Challenge. 304 | 305 | Args: 306 | prec (list of numpy.array): A list of arrays. 307 | :obj:`prec[l]` indicates precision for class :math:`l`. 308 | If :obj:`prec[l]` is :obj:`None`, this function returns 309 | :obj:`numpy.nan` for class :math:`l`. 310 | rec (list of numpy.array): A list of arrays. 311 | :obj:`rec[l]` indicates recall for class :math:`l`. 312 | If :obj:`rec[l]` is :obj:`None`, this function returns 313 | :obj:`numpy.nan` for class :math:`l`. 314 | use_07_metric (bool): Whether to use PASCAL VOC 2007 evaluation metric 315 | for calculating average precision. The default value is 316 | :obj:`False`. 317 | 318 | Returns: 319 | ~numpy.ndarray: 320 | This function returns an array of average precisions. 321 | The :math:`l`-th value corresponds to the average precision 322 | for class :math:`l`. If :obj:`prec[l]` or :obj:`rec[l]` is 323 | :obj:`None`, the corresponding value is set to :obj:`numpy.nan`. 324 | 325 | """ 326 | 327 | n_fg_class = len(prec) 328 | ap = np.empty(n_fg_class) 329 | for l in six.moves.range(n_fg_class): 330 | if prec[l] is None or rec[l] is None: 331 | ap[l] = np.nan 332 | continue 333 | 334 | if use_07_metric: 335 | # 11 point metric 336 | ap[l] = 0 337 | for t in np.arange(0., 1.1, 0.1): 338 | if np.sum(rec[l] >= t) == 0: 339 | p = 0 340 | else: 341 | p = np.max(np.nan_to_num(prec[l])[rec[l] >= t]) 342 | ap[l] += p / 11 343 | else: 344 | # correct AP calculation 345 | # first append sentinel values at the end 346 | mpre = np.concatenate(([0], np.nan_to_num(prec[l]), [0])) 347 | mrec = np.concatenate(([0], rec[l], [1])) 348 | 349 | mpre = np.maximum.accumulate(mpre[::-1])[::-1] 350 | 351 | # to calculate area under PR curve, look for points 352 | # where X axis (recall) changes value 353 | i = np.where(mrec[1:] != mrec[:-1])[0] 354 | 355 | # and sum (\Delta recall) * prec 356 | ap[l] = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1]) 357 | 358 | return ap 359 | -------------------------------------------------------------------------------- /imgs/DSOD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/DSOD.png -------------------------------------------------------------------------------- /imgs/DSSD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/DSSD.png -------------------------------------------------------------------------------- /imgs/DeformableConv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/DeformableConv.png -------------------------------------------------------------------------------- /imgs/DenseBox.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/DenseBox.png -------------------------------------------------------------------------------- /imgs/DetectorNet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/DetectorNet.png -------------------------------------------------------------------------------- /imgs/ESSD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/ESSD.png -------------------------------------------------------------------------------- /imgs/Extension_module.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/Extension_module.png -------------------------------------------------------------------------------- /imgs/FCN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/FCN.png -------------------------------------------------------------------------------- /imgs/FCN_in_test.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/FCN_in_test.png -------------------------------------------------------------------------------- /imgs/FPN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/FPN.png -------------------------------------------------------------------------------- /imgs/FSSD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/FSSD.png -------------------------------------------------------------------------------- /imgs/FaceBoxes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/FaceBoxes.png -------------------------------------------------------------------------------- /imgs/Fast_R-CNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/Fast_R-CNN.png -------------------------------------------------------------------------------- /imgs/Faster_R-CNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/Faster_R-CNN.png -------------------------------------------------------------------------------- /imgs/FocalLoss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/FocalLoss.png -------------------------------------------------------------------------------- /imgs/Instance_segmentation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/Instance_segmentation.png -------------------------------------------------------------------------------- /imgs/Light-Head.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/Light-Head.png -------------------------------------------------------------------------------- /imgs/MTCNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/MTCNN.png -------------------------------------------------------------------------------- /imgs/MaskX.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/MaskX.png -------------------------------------------------------------------------------- /imgs/MaskX_show.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/MaskX_show.png -------------------------------------------------------------------------------- /imgs/Mask_R-CNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/Mask_R-CNN.png -------------------------------------------------------------------------------- /imgs/R-CNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/R-CNN.png -------------------------------------------------------------------------------- /imgs/R-FCN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/R-FCN.png -------------------------------------------------------------------------------- /imgs/RFB_module.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/RFB_module.png -------------------------------------------------------------------------------- /imgs/ROIAlign.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/ROIAlign.png -------------------------------------------------------------------------------- /imgs/RetinaNet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/RetinaNet.png -------------------------------------------------------------------------------- /imgs/SPP-net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/SPP-net.png -------------------------------------------------------------------------------- /imgs/SSD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/SSD.png -------------------------------------------------------------------------------- /imgs/SSD_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/SSD_model.png -------------------------------------------------------------------------------- /imgs/YOLO.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/YOLO.png -------------------------------------------------------------------------------- /imgs/YOLO9000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/YOLO9000.png -------------------------------------------------------------------------------- /imgs/YOLO_Bbox.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/YOLO_Bbox.png -------------------------------------------------------------------------------- /imgs/YOLO_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/YOLO_loss.png -------------------------------------------------------------------------------- /imgs/YOLOv2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/YOLOv2.png -------------------------------------------------------------------------------- /imgs/fc2Conv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/fc2Conv.png -------------------------------------------------------------------------------- /imgs/focal_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/focal_loss.png -------------------------------------------------------------------------------- /imgs/inference_YOLO.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/inference_YOLO.png -------------------------------------------------------------------------------- /imgs/offset_MaxPooling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/offset_MaxPooling.png -------------------------------------------------------------------------------- /imgs/position-sensitive_RoI_pooling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/position-sensitive_RoI_pooling.png -------------------------------------------------------------------------------- /imgs/receptive_field.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/receptive_field.png -------------------------------------------------------------------------------- /imgs/skip_layers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mandeer/Detector/4f5a94afcc12e413717431b61a717d8300b66ff8/imgs/skip_layers.png -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | from .retinaNet import RetinaNet -------------------------------------------------------------------------------- /models/loss/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | from models.loss.focal_loss import FocalLoss 4 | -------------------------------------------------------------------------------- /models/loss/focal_loss.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | from torch.autograd import Variable 7 | 8 | 9 | def one_hot_embedding(labels, num_classes): 10 | '''Embedding labels to one-hot form. 11 | 12 | Args: 13 | labels: (LongTensor) class labels, sized [N,]. 14 | num_classes: (int) number of classes. 15 | 16 | Returns: 17 | (tensor) encoded labels, sized [N,#classes]. 18 | ''' 19 | y = torch.eye(num_classes) # [D,D] 20 | return y[labels] # [N,D] 21 | 22 | 23 | class FocalLoss(nn.Module): 24 | def __init__(self, num_classes=20): 25 | super(FocalLoss, self).__init__() 26 | self.num_classes = num_classes 27 | self.iteration = 0 28 | 29 | def focal_loss(self, x, y, change_alpha): 30 | '''Focal loss. 31 | Args: 32 | x: (tensor) sized [N,D]. 33 | y: (tensor) sized [N,]. 34 | Return: 35 | (tensor) focal loss. 36 | ''' 37 | alpha = max(0.25, 1 - pow(10, self.iteration // 1000) / 10000) 38 | gamma = 2 39 | if change_alpha and alpha > 0.25: 40 | self.iteration += 1 41 | print('iteration: ', self.iteration, ' alpha: ', alpha) 42 | 43 | t = one_hot_embedding(y.data.cpu(), 1+self.num_classes) # [N,21] 44 | t = t[:,1:] # exclude background 45 | t = Variable(t).cuda() # [N,20] 46 | 47 | p = x.sigmoid() 48 | delta = p*(1-t) + (1-p)*t # delta = 1-p if t > 0 else p 49 | at = alpha*t + (1-alpha)*(1-t) # at = alpha if t > 0 else 1-alpha 50 | w = at * delta.pow(gamma) 51 | return F.binary_cross_entropy_with_logits(x, t, w, size_average=False) 52 | 53 | def focal_loss_alt(self, x, y, change_alpha): 54 | '''Focal loss alternative. 55 | Args: 56 | x: (tensor) sized [N,D]. 57 | y: (tensor) sized [N,]. 58 | Return: 59 | (tensor) focal loss. 60 | ''' 61 | alpha = max(0.25, 1 - pow(10, self.iteration // 1000) / 10000) 62 | beta = 1 63 | gamma = 2 64 | if change_alpha and alpha > 0.25: 65 | self.iteration += 1 66 | print('iteration: ', self.iteration, ' alpha: ', alpha) 67 | 68 | 69 | t = one_hot_embedding(y.data.cpu(), 1+self.num_classes) 70 | t = t[:,1:] 71 | t = Variable(t).cuda() 72 | 73 | xt = x*(gamma*t-beta) # xt = x if t > 0 else -x 74 | pt = (gamma*xt+beta).sigmoid() 75 | 76 | at = alpha*t + (1-alpha)*(1-t) 77 | loss = -at*pt.log() / gamma 78 | return loss.sum() 79 | 80 | def forward(self, loc_preds, loc_targets, cls_preds, cls_targets, change_alpha=True): 81 | '''Compute loss between (loc_preds, loc_targets) and (cls_preds, cls_targets). 82 | Args: 83 | loc_preds: (tensor) predicted locations, sized [batch_size, #anchors, 4]. 84 | loc_targets: (tensor) encoded target locations, sized [batch_size, #anchors, 4]. 85 | cls_preds: (tensor) predicted class confidences, sized [batch_size, #anchors, #classes]. 86 | cls_targets: (tensor) encoded target labels, sized [batch_size, #anchors]. 87 | loss: 88 | (tensor) loss = SmoothL1Loss(loc_preds, loc_targets) + FocalLoss(cls_preds, cls_targets). 89 | ''' 90 | batch_size, num_boxes = cls_targets.size() 91 | pos = cls_targets > 0 # [N,#anchors] 92 | num_pos = pos.data.long().sum() 93 | 94 | ################################################################ 95 | # loc_loss = SmoothL1Loss(pos_loc_preds, pos_loc_targets) 96 | ################################################################ 97 | mask = pos.unsqueeze(2).expand_as(loc_preds) # [N,#anchors,4] 98 | masked_loc_preds = loc_preds[mask].view(-1,4) # [#pos,4] 99 | masked_loc_targets = loc_targets[mask].view(-1,4) # [#pos,4] 100 | loc_loss = F.smooth_l1_loss(masked_loc_preds, masked_loc_targets, size_average=False) 101 | 102 | ################################################################ 103 | # cls_loss = FocalLoss(loc_preds, loc_targets) 104 | ################################################################ 105 | pos_neg = cls_targets > -1 # exclude ignored anchors 106 | mask = pos_neg.unsqueeze(2).expand_as(cls_preds) 107 | masked_cls_preds = cls_preds[mask].view(-1,self.num_classes) 108 | cls_loss = self.focal_loss(masked_cls_preds, cls_targets[pos_neg], change_alpha) 109 | 110 | # print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.data[0]/num_pos, cls_loss.data[0]/num_pos), end=' | ') 111 | loss = (loc_loss + cls_loss)/(num_pos + 0.001) 112 | return loss 113 | 114 | 115 | if __name__ == '__main__': 116 | import numpy as np 117 | import matplotlib.pyplot as plt 118 | delta = np.arange(0, 1.01, 0.01) 119 | gamma = np.array([0, 0.5, 1, 2, 5]) 120 | 121 | a1 = plt.subplot(1, 2, 1) 122 | plt.title('The weight for Cross-Entropy loss') 123 | plt.xlabel('delta') 124 | plt.ylabel('weight') 125 | a2 = plt.subplot(1, 2, 2) 126 | plt.title('Focal Loss') 127 | plt.xlabel('delta') 128 | plt.ylabel('loss') 129 | for i in range(len(gamma)): 130 | weight = np.power(delta, gamma[i]) 131 | a1.plot(delta, weight, label='gamma: ' + str(gamma[i])) 132 | loss = -1 * weight * np.log(1-delta) 133 | a2.plot(delta, loss, label='gamma: ' + str(gamma[i])) 134 | 135 | a1.legend() 136 | a2.legend() 137 | plt.show() 138 | -------------------------------------------------------------------------------- /models/retinaNet/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | from .retina import RetinaNet -------------------------------------------------------------------------------- /models/retinaNet/fpn.py: -------------------------------------------------------------------------------- 1 | '''FPN in PyTorch.''' 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | 6 | 7 | class Bottleneck(nn.Module): 8 | expansion = 4 9 | 10 | def __init__(self, in_planes, planes, stride=1): 11 | super(Bottleneck, self).__init__() 12 | self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False) 13 | self.bn1 = nn.BatchNorm2d(planes) 14 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False) 15 | self.bn2 = nn.BatchNorm2d(planes) 16 | self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False) 17 | self.bn3 = nn.BatchNorm2d(self.expansion*planes) 18 | 19 | self.downsample = nn.Sequential() 20 | if stride != 1 or in_planes != self.expansion*planes: 21 | self.downsample = nn.Sequential( 22 | nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False), 23 | nn.BatchNorm2d(self.expansion*planes) 24 | ) 25 | 26 | def forward(self, x): 27 | out = F.relu(self.bn1(self.conv1(x))) 28 | out = F.relu(self.bn2(self.conv2(out))) 29 | out = self.bn3(self.conv3(out)) 30 | out += self.downsample(x) 31 | out = F.relu(out) 32 | return out 33 | 34 | 35 | class FPN(nn.Module): 36 | def __init__(self, block, num_blocks): 37 | super(FPN, self).__init__() 38 | self.in_planes = 64 39 | 40 | self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False) 41 | self.bn1 = nn.BatchNorm2d(64) 42 | 43 | # Bottom-up layers 44 | self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1) 45 | self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2) 46 | self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2) 47 | self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2) 48 | self.conv6 = nn.Conv2d(2048, 256, kernel_size=3, stride=2, padding=1) 49 | self.conv7 = nn.Conv2d( 256, 256, kernel_size=3, stride=2, padding=1) 50 | 51 | # Top-down layers 52 | self.toplayer = nn.Conv2d(2048, 256, kernel_size=1, stride=1, padding=0) 53 | 54 | # Lateral layers 55 | self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0) 56 | self.latlayer2 = nn.Conv2d( 512, 256, kernel_size=1, stride=1, padding=0) 57 | 58 | # Smooth layers 59 | self.smooth1 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1) 60 | self.smooth2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1) 61 | 62 | def _make_layer(self, block, planes, num_blocks, stride): 63 | strides = [stride] + [1]*(num_blocks-1) 64 | layers = [] 65 | for stride in strides: 66 | layers.append(block(self.in_planes, planes, stride)) 67 | self.in_planes = planes * block.expansion 68 | return nn.Sequential(*layers) 69 | 70 | def _upsample_add(self, x, y): 71 | '''Upsample and add two feature maps. 72 | 73 | Args: 74 | x: (Variable) top feature map to be upsampled. 75 | y: (Variable) lateral feature map. 76 | 77 | Returns: 78 | (Variable) added feature map. 79 | 80 | Note in PyTorch, when input size is odd, the upsampled feature map 81 | with `F.upsample(..., scale_factor=2, mode='nearest')` 82 | maybe not equal to the lateral feature map size. 83 | 84 | e.g. 85 | original input size: [N,_,15,15] -> 86 | conv2d feature map size: [N,_,8,8] -> 87 | upsampled feature map size: [N,_,16,16] 88 | 89 | So we choose bilinear upsample which supports arbitrary output sizes. 90 | ''' 91 | _,_,H,W = y.size() 92 | return F.upsample(x, size=(H,W), mode='bilinear') + y 93 | 94 | def forward(self, x): 95 | # Bottom-up 96 | c1 = F.relu(self.bn1(self.conv1(x))) 97 | c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1) 98 | c2 = self.layer1(c1) 99 | c3 = self.layer2(c2) 100 | c4 = self.layer3(c3) 101 | c5 = self.layer4(c4) 102 | p6 = self.conv6(c5) 103 | p7 = self.conv7(F.relu(p6)) 104 | # Top-down 105 | p5 = self.toplayer(c5) 106 | p4 = self._upsample_add(p5, self.latlayer1(c4)) 107 | p4 = self.smooth1(p4) 108 | p3 = self._upsample_add(p4, self.latlayer2(c3)) 109 | p3 = self.smooth2(p3) 110 | return p3, p4, p5, p6, p7 111 | 112 | 113 | def FPN50(): 114 | return FPN(Bottleneck, [3,4,6,3]) 115 | 116 | def FPN101(): 117 | return FPN(Bottleneck, [3,4,23,3]) 118 | 119 | 120 | if __name__ == '__main__': 121 | from torch.autograd import Variable 122 | def test(): 123 | net = FPN50() 124 | fms = net(Variable(torch.randn(1, 3, 640, 640))) 125 | print(net) 126 | for fm in fms: 127 | print(fm.size()) 128 | 129 | test() 130 | -------------------------------------------------------------------------------- /models/retinaNet/get_state_dict.py: -------------------------------------------------------------------------------- 1 | import os 2 | import math 3 | import torch 4 | 5 | from models.retinaNet.retina import RetinaNet 6 | 7 | 8 | model_dir = '../preTrainedModels/resnet' 9 | params = torch.load(os.path.join(model_dir, 'resnet50-19c8e357.pth')) 10 | 11 | net = RetinaNet(num_classes=21) 12 | net.fpn.load_state_dict(params, strict=False) 13 | 14 | torch.nn.init.constant(net.cls_head[-1].bias, -math.log(1-0.01)/0.01) 15 | torch.save(net.state_dict(), os.path.join(model_dir, 'retinanet_resnet50_voc.pth')) 16 | -------------------------------------------------------------------------------- /models/retinaNet/retina.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from models.retinaNet.fpn import FPN50 5 | 6 | 7 | class RetinaNet(nn.Module): 8 | num_anchors = 9 9 | 10 | def __init__(self, num_classes): 11 | super(RetinaNet, self).__init__() 12 | self.fpn = FPN50() 13 | self.num_classes = num_classes 14 | self.loc_head = self._make_head(self.num_anchors*4) 15 | self.cls_head = self._make_head(self.num_anchors*self.num_classes) 16 | 17 | def forward(self, x): 18 | loc_preds = [] 19 | cls_preds = [] 20 | fms = self.fpn(x) 21 | for fm in fms: 22 | loc_pred = self.loc_head(fm) 23 | cls_pred = self.cls_head(fm) 24 | loc_pred = loc_pred.permute(0,2,3,1).contiguous().view(x.size(0),-1,4) # [N, 9*4,H,W] -> [N,H,W, 9*4] -> [N,H*W*9, 4] 25 | cls_pred = cls_pred.permute(0,2,3,1).contiguous().view(x.size(0),-1,self.num_classes) # [N,9*NC,H,W] -> [N,H,W,9*NC] -> [N,H*W*9,NC] 26 | loc_preds.append(loc_pred) 27 | cls_preds.append(cls_pred) 28 | return torch.cat(loc_preds, 1), torch.cat(cls_preds, 1) 29 | 30 | def _make_head(self, out_planes): 31 | layers = [] 32 | for _ in range(4): 33 | layers.append(nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)) 34 | layers.append(nn.ReLU(True)) 35 | layers.append(nn.Conv2d(256, out_planes, kernel_size=3, stride=1, padding=1)) 36 | return nn.Sequential(*layers) 37 | 38 | 39 | if __name__ == '__main__': 40 | from torch.autograd import Variable 41 | def test(): 42 | net = RetinaNet(21) 43 | loc_preds, cls_preds = net(Variable(torch.randn(1, 3, 640, 640))) 44 | print(net) 45 | print(loc_preds.size(), cls_preds.size()) 46 | 47 | test() 48 | -------------------------------------------------------------------------------- /retina.md: -------------------------------------------------------------------------------- 1 | # RetinaNet-PyTorch 2 | 3 | ## 参考 4 | * [kuangliu/pytorch-retinanet](https://github.com/kuangliu/pytorch-retinanet) 5 | * [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) 6 | 7 | ## 依赖 8 | * PyTorch-0.3 9 | 10 | ### Focal Loss 11 | ![focal_loss](./imgs/focal_loss.png) 12 | * Focal Loss = weight * Cross-Entropy loss 13 | * delta 是指预测值与真实值之间的差异 14 | * gamma = 0 时, Focal Loss 就是 Cross-Entropy loss 15 | 16 | ### 训练初期易发散 17 | * 将iou<0.5的author的类别设为0时, 会在训练的初期就发散 18 | * 当lr=0.01时, 需要将Focal Loss的alpha设为0.9999才能不发散(背景的author太多) 19 | -------------------------------------------------------------------------------- /train_test/eval.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision 3 | import torch.nn.functional as F 4 | import torchvision.transforms as transforms 5 | 6 | from torch.autograd import Variable 7 | from dataloader.dataaugmentor import DataAugmentor 8 | from dataloader.dataset import DataSet 9 | from evaluations.voc_eval import voc_eval 10 | from models.retinaNet import RetinaNet 11 | from dataloader.en_decoder import RetinaBoxCoder 12 | 13 | from PIL import Image 14 | 15 | 16 | print('Loading model..') 17 | net = RetinaNet(num_classes=21) 18 | checkpoint = torch.load('../output/ckpt.pth') 19 | net.load_state_dict(checkpoint['net']) 20 | net.cuda() 21 | net.eval() 22 | box_coder = RetinaBoxCoder(imgSize=640) 23 | dataugmentor = DataAugmentor(imgSize=640) 24 | 25 | print('Preparing dataset..') 26 | def transform(img, boxes, labels): 27 | img, boxes = dataugmentor.resize(img, boxes) 28 | img = transforms.Compose([ 29 | transforms.ToTensor(), 30 | transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) 31 | ])(img) 32 | return img, boxes, labels 33 | 34 | dataset = DataSet(root='../datasets/voc/VOC2007/JPEGImages', 35 | list_file='../datasets/voc//voc07_test.txt', 36 | transform=transform) 37 | 38 | dataloader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, num_workers=2) 39 | 40 | pred_boxes = [] 41 | pred_labels = [] 42 | pred_scores = [] 43 | gt_boxes = [] 44 | gt_labels = [] 45 | 46 | with open('../datasets/voc/voc07_test_difficult.txt') as f: 47 | gt_difficults = [] 48 | for line in f.readlines(): 49 | line = line.strip().split() 50 | d = [int(x) for x in line[1:]] 51 | gt_difficults.append(d) 52 | 53 | def eval(net, dataset): 54 | for i, (inputs, box_targets, label_targets) in enumerate(dataloader): 55 | print('%d/%d' % (i, len(dataloader))) 56 | gt_boxes.append(box_targets.squeeze(0)) 57 | gt_labels.append(label_targets.squeeze(0)) 58 | 59 | loc_preds, cls_preds = net(Variable(inputs.cuda(), volatile=True)) 60 | box_preds, label_preds, score_preds = box_coder.decode( 61 | loc_preds.cpu().data.squeeze(), 62 | cls_preds.cpu().data.squeeze(), 63 | input_size=[640.0, 640.0]) 64 | 65 | pred_boxes.append(box_preds) 66 | pred_labels.append(label_preds) 67 | pred_scores.append(score_preds) 68 | 69 | print(voc_eval( 70 | pred_boxes, pred_labels, pred_scores, 71 | gt_boxes, gt_labels, gt_difficults, 72 | iou_thresh=0.5, use_07_metric=True)) 73 | 74 | eval(net, dataset) 75 | -------------------------------------------------------------------------------- /train_test/train_retinanet.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import os 4 | import random 5 | import argparse 6 | import torch 7 | import torch.optim as optim 8 | from torch.autograd import Variable 9 | import torch.optim.lr_scheduler as lr_scheduler 10 | from dataloader import get_data_loader 11 | from models import RetinaNet 12 | from models.loss import FocalLoss 13 | 14 | 15 | class Solver(object): 16 | def __init__(self, config, model, trainLoader, testLoader): 17 | self.model = model 18 | self.trainLoader = trainLoader 19 | self.testLoader = testLoader 20 | self.n_classes = config.n_classes 21 | self.use_cuda = config.use_cuda 22 | 23 | self.optimizer = optim.SGD(self.model.parameters(), lr=config.lr, momentum=0.9, weight_decay=1e-4) 24 | self.criterion = FocalLoss(num_classes=self.n_classes) 25 | self.lr_scheduler = lr_scheduler.MultiStepLR(self.optimizer, milestones=[6, 9], gamma=0.1) 26 | if self.use_cuda: 27 | self.model = self.model.cuda() 28 | self.criterion = self.criterion.cuda() 29 | 30 | self.n_epochs = config.n_epochs 31 | self.log_step = config.log_step 32 | self.out_path = config.out_path 33 | self.best_loss = float('inf') 34 | 35 | def train(self, epoch): 36 | print('\nEpoch: %d' % epoch) 37 | self.model.train() 38 | self.lr_scheduler.step() 39 | train_loss = 0 40 | for batch_idx, (inputs, loc_targets, cls_targets) in enumerate(self.trainLoader): 41 | if self.use_cuda: 42 | inputs = Variable(inputs).cuda() 43 | loc_targets = Variable(loc_targets).cuda() 44 | cls_targets = Variable(cls_targets).cuda() 45 | 46 | self.optimizer.zero_grad() 47 | loc_preds, cls_preds = self.model(inputs) 48 | loss = self.criterion(loc_preds, loc_targets, cls_preds, cls_targets, change_alpha=True) 49 | loss.backward() 50 | self.optimizer.step() 51 | 52 | train_loss += float(loss.data[0]) 53 | print('train_loss: %.3f | avg_loss: %.3f [%d/%d]' 54 | % (loss.data[0], train_loss / (batch_idx + 1), batch_idx + 1, len(self.trainLoader))) 55 | 56 | def test(self, epoch): 57 | print('\nTest') 58 | self.model.eval() 59 | test_loss = 0 60 | for batch_idx, (inputs, loc_targets, cls_targets) in enumerate(self.testLoader): 61 | if self.use_cuda: 62 | inputs = Variable(inputs).cuda() 63 | loc_targets = Variable(loc_targets).cuda() 64 | cls_targets = Variable(cls_targets).cuda() 65 | 66 | loc_preds, cls_preds = self.model(inputs) 67 | loss = self.criterion(loc_preds, loc_targets, cls_preds, cls_targets, change_alpha=False) 68 | test_loss += float(loss.data[0]) 69 | print('test_loss: %.3f | avg_loss: %.3f [%d/%d]' 70 | % (loss.data[0], test_loss / (batch_idx + 1), batch_idx + 1, len(self.testLoader))) 71 | 72 | # Save checkpoint 73 | test_loss /= len(self.testLoader) 74 | if test_loss < self.best_loss: 75 | print('Saving..') 76 | state = { 77 | 'net': self.model.state_dict(), 78 | 'loss': test_loss, 79 | 'epoch': epoch, 80 | } 81 | if not os.path.isdir(os.path.dirname(config.checkpoint)): 82 | os.mkdir(os.path.dirname(config.checkpoint)) 83 | torch.save(state, config.checkpoint) 84 | self.best_loss = test_loss 85 | 86 | def main(config): 87 | # use cuda ? 88 | if config.use_cuda: 89 | from torch.backends import cudnn 90 | cudnn.benchmark = True 91 | elif torch.cuda.is_available(): 92 | print("WARNING: You have a CUDA device, so you should probably run with --cuda") 93 | 94 | # seed 95 | if config.seed == 0: 96 | config.seed = random.randint(1, 10000) # fix seed 97 | print("Random Seed: ", config.seed) 98 | random.seed(config.seed) 99 | torch.manual_seed(config.seed) 100 | if config.use_cuda: 101 | torch.cuda.manual_seed_all(config.seed) 102 | 103 | # create directories if not exist 104 | if not os.path.exists(config.out_path): 105 | os.makedirs(config.out_path) 106 | 107 | # dataLoader 108 | trainLoader, testLoader = get_data_loader(config) 109 | print('train samples num: ', len(trainLoader), ' test samples num: ', len(testLoader)) 110 | 111 | # model net 112 | model = RetinaNet(num_classes=config.n_classes) 113 | print(model) 114 | if config.pretrained != '': 115 | model.load_state_dict(torch.load(config.pretrained)) 116 | print('load', config.pretrained) 117 | 118 | solver = Solver(config, model, trainLoader, testLoader) 119 | for epoch in range(config.n_epochs): 120 | solver.train(epoch) 121 | solver.test(epoch) 122 | 123 | 124 | if __name__ == '__main__': 125 | parser = argparse.ArgumentParser() 126 | 127 | parser.add_argument('--image-size', type=int, default=2) 128 | parser.add_argument('--n-epochs', type=int, default=12) 129 | parser.add_argument('--batch-size', type=int, default=4) 130 | parser.add_argument('--n-workers', type=int, default=4) 131 | parser.add_argument('--lr', type=float, default=0.01) 132 | parser.add_argument('--out-path', type=str, default='./output') 133 | parser.add_argument('--seed', type=int, default=666, help='random seed for all') 134 | parser.add_argument('--log-step', type=int, default=100) 135 | parser.add_argument('--use-cuda', type=bool, default=True, help='enables cuda') 136 | 137 | parser.add_argument('--train-root', type=str, default='../datasets/voc/VOC2007/JPEGImages') 138 | parser.add_argument('--train-label-file', type=str, default='../datasets/voc/voc07_trainval.txt') 139 | parser.add_argument('--test-root', type=str, default='../datasets/voc/VOC2007/JPEGImages') 140 | parser.add_argument('--test-label-file', type=str, default='../datasets/voc//voc07_test.txt') 141 | parser.add_argument('--n_classes', type=int, default=21) 142 | parser.add_argument('--mode', type=str, default='train', help='train, test') 143 | parser.add_argument('--model', type=str, default='RetinaNet', help='model') 144 | parser.add_argument('--pretrained', type=str, default='../preTrainedModels/retina/retinanet_resnet50_voc.pth') 145 | parser.add_argument('--checkpoint', type=str, default='../output/ckpt.pth') 146 | 147 | 148 | config = parser.parse_args() 149 | if config.use_cuda and not torch.cuda.is_available(): 150 | config.use_cuda = False 151 | print("WARNING: You have no CUDA device") 152 | 153 | args = vars(config) 154 | print('------------ Options -------------') 155 | for key, value in sorted(args.items()): 156 | print('%16.16s: %16.16s' % (str(key), str(value))) 157 | print('-------------- End ----------------') 158 | 159 | main(config) 160 | print('End!!') --------------------------------------------------------------------------------