├── imgs ├── MANet.png ├── STMN.png └── STSN.png ├── STSN.md ├── MANet.md ├── STMN.md └── README.md /imgs/MANet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhanghengdev/awesome-video-object-detection/HEAD/imgs/MANet.png -------------------------------------------------------------------------------- /imgs/STMN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhanghengdev/awesome-video-object-detection/HEAD/imgs/STMN.png -------------------------------------------------------------------------------- /imgs/STSN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhanghengdev/awesome-video-object-detection/HEAD/imgs/STSN.png -------------------------------------------------------------------------------- /STSN.md: -------------------------------------------------------------------------------- 1 | # Object Detection in Video with Spatiotemporal Sampling Networks 2 | 3 | ## Architecture 4 | 5 | ![](imgs/STSN.png) 6 | 7 | ## Summary 8 | 9 | Using [deformable convolutions](https://arxiv.org/abs/1703.06211) across space and time (instead of optical flow) to leverage temporal information for object detection in video, i.e., using **deformable convolutions** to sample relevant features from nearby frames (27 frames in total) and using **temporally aggregagtion** (per-pixel weighted summation) to generate final feature maps for detection network ([R-FCN](https://arxiv.org/abs/1605.06409)). 10 | -------------------------------------------------------------------------------- /MANet.md: -------------------------------------------------------------------------------- 1 | # Fully Motion-Aware Network for Video Object Detection 2 | 3 | ## Architecture 4 | 5 | ![](imgs/MANet.png) 6 | 7 | ## Summary 8 | 9 | Similar with [FGFA](https://arxiv.org/abs/1703.10025), but in addtion to pixel-level feature calibration and aggregagtion, [MANet](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html) proposes the **motion pattern reasoning module** to dynamically combine (learnable soft weights) **pixel-level** and **instance-level** calibration according to the motion (optical flow by [FlowNet](https://arxiv.org/abs/1504.06852)). **Instance-level calibration** is achieved by regressing relative movements $(\Delta x , \Delta y , \Delta w , \Delta h)$ on the optical flow estimation according to proposal positions of reference frame. Final feaure maps for detection network ([R-FCN](https://arxiv.org/abs/1605.06409)) are the aggregation of nearby (13 frames in total) calibrated feature maps. Pixel-level calibration achieves better improvements for non-rigid movements while instance-level calibration is better for rigid movements and occlusion cases. 10 | -------------------------------------------------------------------------------- /STMN.md: -------------------------------------------------------------------------------- 1 | # Video Object Detection with an Aligned Spatial-Temporal Memory 2 | 3 | ## Architecture 4 | 5 | ![](imgs/STMN.png) 6 | 7 | ## Summary 8 | 9 | Proposing 1. a novel **Spatial-Temporal Memory module (STMM)** (as the recurrent computation unit) to **model** long-term temoral appearance and motion dynamicis; 2. a novel **MatchTrans module** to **align** the Spatial-Temporal Memory (feature maps) across frames. 10 | Assuming $F_t$ as the appearane feature for the current frame and $M^{\rightarrow}_{t-1}$ as the the feature of all previous frames, the **STMM updates $M^{\rightarrow}_{t}$ with the input $F_t$ and $M^{\rightarrow}_{t-1}$**. Two STMMs are used to obtain feature maps from both directions and the final feature maps are the concatenation of $M^{\rightarrow}_{t}$ and $M^{\leftarrow}_{t}$. The **MatchTrans module** computes transformation coefficients $\Gamma$ for position (x,y) from $M_{t-1}$ to $M^{'}_{t-1}$ (' means matched to $F_t$) by **measuing the similarity** between $F_t(x,y)$ and $F_{t-1}(x+i,y+j)$. The transformation coefficients are then used to synthesize $M^{'}_{t-1}$ by **interpolating** the corresponding $M_{t-1}$ feature vectors: $M^{'}_{t-1}(x,y)=\sum_{i,j \in \{-k,...k\}} \Gamma_{x,y}(i,j) \cdot M_{t-1}(x+i,y+j)$. Sequence length $T=7$ for training and and $T=11$ for testing. During testing, STMN detector and initial R-FCN detector detection results are ensembled. 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Video-Object-Detection 2 | 3 | ![Intro](https://github.com/ZHANGHeng19931123/seq_nms_yolo/raw/master/doc/intro1.gif "Intro") 4 | 5 | This is a list of awesome articles about **object detection from video**. 6 | 7 | ## Datasets 8 | 9 | ### ImageNet VID Challenge 10 | - **Site**: http://image-net.org/challenges/LSVRC/2017/#vid 11 | - **Kagge**: https://www.kaggle.com/account/login?returnUrl=%2Fc%2Fimagenet-object-detection-from-video-challenge 12 | 13 | ### VisDrone Challenge 14 | - **Site**: http://aiskyeye.com/ 15 | 16 | ## Paper list 17 | 18 | ### 2016 19 | 20 | #### Seq-NMS for Video Object Detection 21 | 22 | [[Arxiv]](https://arxiv.org/abs/1602.08465) 23 | 24 | - **Date**: Feb 2016 25 | - **Motivation**: Smoothing the final bounding box predictions across time. 26 | - **Summary**: Constructing a temporal graph from overlapping bounding box detections across the adjacent frames, and using dynamic programming to select bounding box sequences with the highest overall detection score. 27 | 28 | #### T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos 29 | 30 | [[Arxiv]](https://arxiv.org/abs/1604.02532) [[Code]](https://github.com/myfavouritekk/T-CNN) 31 | 32 | - **Date**: Apr 2016 33 | - **Summary**: Using a video object detection pipeline that involves predicting optical flow first, then propagating image level predictions according to the flow, and finally using a tracking algorithm to select temporally consistent high confidence detections. 34 | - **Performance**: 73.8% mAP on ImageNet VID validation. 35 | 36 | #### Object Detection from Video Tubelets with Convolutional Neural Networks 37 | 38 | [[Arxiv]](https://arxiv.org/abs/1604.04053) [[Code]](https://github.com/myfavouritekk/vdetlib) 39 | 40 | - **Date**: Apr 2016 41 | 42 | #### Deep Feature Flow for Video Recognition 43 | 44 | [[Arxiv]](https://arxiv.org/abs/1611.07715) [[Code]](https://github.com/msracver/Deep-Feature-Flow) 45 | 46 | - **Date**: Nov 2016 47 | - **Performance**: 73.0% mAP on ImageNet VID validation at 29 fps on a Titan X GPU. 48 | 49 | ### 2017 50 | 51 | #### Object Detection in Videos with Tubelet Proposal Networks 52 | 53 | [[Arxiv]](https://arxiv.org/abs/1702.06355) 54 | 55 | - **Date**: Feb 2017 56 | 57 | #### Flow-Guided Feature Aggregation for Video Object Detection 58 | 59 | [[Arxiv]](https://arxiv.org/abs/1703.10025) [[Code]](https://github.com/msracver/Flow-Guided-Feature-Aggregation) 60 | 61 | - **Date**: Mar 2017 62 | - **Motivation**: Producing powerful spatiotemporal features. 63 | - **Performance**: 76.3% mAP at 1.4 fps or 78.4% (combined with [Seq-NMS](https://arxiv.org/abs/1602.08465)) at 1.1 fps on ImageNet VID validation on a Titan X GPU. 64 | 65 | #### Detect to Track and Track to Detect 66 | 67 | [[Arxiv]](https://arxiv.org/abs/1710.03958) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/X.md) [[Code]](https://github.com/feichtenhofer/Detect-Track) 68 | 69 | - **Date**: Oct 2017 70 | - **Motivation**: Smoothing the final bounding box predictions across time. 71 | - **Summary**: Proposing a ConvNet architecture that solves detection and tracking problems jointly and applying a Viterbi algorithm to link the detections across time. 72 | - **Performance**: 79.8% mAP on ImageNet VID validation. 73 | 74 | #### Towards High Performance Video Object Detection 75 | 76 | [[Arxiv]](https://arxiv.org/abs/1711.11577) 77 | 78 | - **Date**: Nov 2017 79 | - **Motivation**: Producing powerful spatiotemporal features. 80 | - **Performance**: 78.6% mAP on ImageNet VID validation at 13 fps on a Titan X GPU. 81 | 82 | #### Video Object Detection with an Aligned Spatial-Temporal Memory 83 | 84 | [[Arxiv]](https://arxiv.org/abs/1712.06317) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/STMN.md) [[Code]](http://fanyix.cs.ucdavis.edu/project/stmn/project.html) [[Demo]](https://www.youtube.com/watch?v=Vs3LqY1s9GY) 85 | 86 | - **Date**: Dec 2017 87 | - **Motivation**: Producing powerful spatiotemporal features. 88 | - **Performance**: 80.5% mAP on ImageNet VID validation. 89 | 90 | ### 2018 91 | 92 | #### Object Detection in Videos by High Quality Object Linking 93 | 94 | [[Arxiv]](https://arxiv.org/abs/1801.09823) 95 | 96 | - **Date**: Jan 2018 97 | 98 | #### Towards High Performance Video Object Detection for Mobiles 99 | 100 | [[Arxiv]](https://arxiv.org/abs/1804.05830) 101 | 102 | - **Date**: Apr 2018 103 | - **Motivation**: Producing powerful spatiotemporal features. 104 | - **Performance**: 60.2% mAP on ImageNet VID validation at 25.6 fps on mobiles. 105 | 106 | #### Optimizing Video Object Detection via a Scale-Time Lattice 107 | 108 | [[Arxiv]](https://arxiv.org/abs/1804.05472) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/X.md) [[Code]](https://github.com/hellock/scale-time-lattice) 109 | 110 | - **Date**: Apr 2018 111 | - **Performance**: 79.4% mAP at 20 fps or 79.0% at 62 fps on ImageNet VID validation on a Titan X GPU. 112 | 113 | #### Object Detection in Video with Spatiotemporal Sampling Networks 114 | 115 | [[Arxiv]](https://arxiv.org/abs/1803.05549) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/STSN.md) 116 | 117 | - **Date**: Mar 2018 118 | - **Motivation**: Producing powerful spatiotemporal features. 119 | - **Performance**: 78.9% mAP or 80.4% (combined with [Seq-NMS](https://arxiv.org/abs/1602.08465)) on ImageNet VID validation. 120 | 121 | #### Fully Motion-Aware Network for Video Object Detection 122 | 123 | [[Paper]](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/MANet.md) 124 | 125 | - **Date**: Stp. 2018 126 | - **Motivation**: Producing powerful spatiotemporal features. 127 | - **Performance**: 78.1% mAP or 80.3% (combined with [Seq-NMS](https://arxiv.org/abs/1602.08465)) on ImageNet VID validation. 128 | 129 | 130 | #### Integrated Object Detection and Tracking with Tracklet-Conditioned Detection 131 | 132 | [[Arxiv]](https://arxiv.org/abs/1811.11167) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/X.md) 133 | 134 | - **Date**: Nov 2018 135 | - **Motivation**: Smoothing the final bounding box predictions across time. 136 | - **Performance**: 83.5% of mAP with [FGFA](https://arxiv.org/abs/1703.10025) and [Deformable ConvNets v2](https://arxiv.org/abs/1811.11168) on ImageNet VID validation. 137 | 138 | ### 2019 139 | 140 | #### AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling 141 | [[arXiv]](https://arxiv.org/pdf/1902.02910.pdf) 142 | 143 | - **Date**: Feb 2019 144 | - **Motivation**: Adaptively rescale the input image resolution to improve both accuracy and speed for video object detection. 145 | - **Performance**: 75.5% of mAP on ImageNet VID validation for 4 different multi-scale training (600, 480, 360, 240). 146 | 147 | #### Improving Video Object Detection by Seq-Bbox Matching 148 | [[pdf]](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjWwNWa95_iAhUMyoUKHR-GAJwQFjABegQIBBAC&url=http%3A%2F%2Fwww.insticc.org%2FPrimoris%2FResources%2FPaperPdf.ashx%3FidPaper%3D72600&usg=AOvVaw1dTqzUoybpNRVkCdkA1xg0) 149 | 150 | - **Date**: Feb 2019 151 | - **Motivation**: Smoothing the final bounding box predictions across time (box-level method). 152 | - **Performance**: 80.9% of mAP (offline detection) and 78.2% of mAP (online detection) both at 38 fps on a Titan X GPU. 153 | 154 | ## Comparison table 155 | 156 | | Paper | Date | Base detector | Backbone | Tracking? | Optical flow? | Online? | mAP(%) | FPS (Titan X) | 157 | | ---|---| ---|---|---|---|---|---|---| 158 | | [Seq-NMS](https://arxiv.org/abs/1602.08465) | Feb 2016 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | no | no | 76.8 | 2.3 | 159 | | [T-CNN](https://arxiv.org/abs/1604.02532) | Apr 2016 | RCNN | DeepIDNet+CRAFT | yes | no | no | 73.8 | - | 160 | | [DFF](https://arxiv.org/abs/1611.07715) | Nov 2016 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | yes | yes | 73.0 | 29 | 161 | | [TPN](https://arxiv.org/abs/1702.06355) | Feb 2017 | TPN | GoogLeNet | yes | no | no | 68.4 | - | 162 | | [FGFA](https://arxiv.org/abs/1703.10025) | Mar 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | yes | yes | 76.3 | 1.4 | 163 | | [FGFA](https://arxiv.org/abs/1703.10025) + [Seq-NMS](https://arxiv.org/abs/1602.08465) | 29 Mar 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | yes | no | 78.4 | 1.14 | 164 | | [D&T](https://arxiv.org/abs/1710.03958) | Oct 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) (15 anchors) | ResNet101 | yes | no | no | 79.8 | 7.09 | 165 | | [STMN](https://arxiv.org/abs/1712.06317) | Dec 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | no | no | 80.5 | - | 166 | | [Scale-time-lattice](https://arxiv.org/abs/1804.05472) | 16 Apr 2018 | [Faster RCNN](https://arxiv.org/abs/1506.01497) (15 anchors)| ResNet101 | no | no | no | 79.6 | 20 | 167 | | [Scale-time-lattice](https://arxiv.org/abs/1804.05472) | Apr 2018 | [Faster RCNN](https://arxiv.org/abs/1506.01497) (15 anchors)| ResNet101 | no | no | no | 79.0 | **62** | 168 | | [SSN](https://arxiv.org/abs/1803.05549) (per-frame baseline for STSN) | Mar 2018 | [R-FCN](https://arxiv.org/abs/1605.06409) | Deformable ResNet101 | no | no | yes | 76.0 | - | 169 | | [STSN](https://arxiv.org/abs/1803.05549) | Mar 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| Deformable ResNet101 | no | no | yes | 78.9 | - | 170 | | [STSN](https://arxiv.org/abs/1803.05549)+[Seq-NMS](https://arxiv.org/abs/1602.08465) | Mar 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| Deformable ResNet101 | no | no | no | 80.4 | - | 171 | | [MANet](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html) | Sep. 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | no | yes | yes | 78.1 | 5 | 172 | | [MANet](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html)+[Seq-NMS](https://arxiv.org/abs/1602.08465) | Sep. 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | no | yes | no | 80.3 | - | 173 | | [Tracklet-Conditioned Detection](https://arxiv.org/abs/1811.11167) | Nov 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | yes | no | yes | 78.1 | - | 174 | | [Tracklet-Conditioned Detection](https://arxiv.org/abs/1811.11167)+[DCNv2](https://arxiv.org/abs/1811.11168) | Nov 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | yes | no | yes | 82.0 | - | 175 | | [Tracklet-Conditioned Detection](https://arxiv.org/abs/1811.11167)+[DCNv2](https://arxiv.org/abs/1811.11168)+[FGFA](https://arxiv.org/abs/1703.10025) | Nov 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | yes | no | yes | **83.5** | - | 176 | | [Seq-Bbox Matching](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjWwNWa95_iAhUMyoUKHR-GAJwQFjABegQIBBAC&url=http%3A%2F%2Fwww.insticc.org%2FPrimoris%2FResources%2FPaperPdf.ashx%3FidPaper%3D72600&usg=AOvVaw1dTqzUoybpNRVkCdkA1xg0) | Feb 2019 | [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)| darknet53 | no | no | no | 80.9 | 38 | 177 | | [Seq-Bbox Matching](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjWwNWa95_iAhUMyoUKHR-GAJwQFjABegQIBBAC&url=http%3A%2F%2Fwww.insticc.org%2FPrimoris%2FResources%2FPaperPdf.ashx%3FidPaper%3D72600&usg=AOvVaw1dTqzUoybpNRVkCdkA1xg0) | Feb 2019 | [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)| darknet53 | no | no | yes | 78.2 | 38 | --------------------------------------------------------------------------------