├── .DS_Store └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iworldtong/Awesome-Temporal-Sentence-Grounding-in-Videos/e39a61cc52ed302a37cfaee72a99409a39bba434/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-Temporal-Sentence-Grounding-in-Videos[![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 2 | 3 |

4 | 5 |

6 | 7 | A curated list of grounding natural language in video and related area. :-) 8 | 9 | ## Introduce 10 | 11 | 本方向主要分为两类任务: 12 | 13 | - **Temporal Activity Localization by Language**:给定一个query(包含对activity的描述),找到对应动作(事件)的起止时间; 14 | 15 |
16 | 17 | - **Spatio-temporal object referring by language**: 给定一个query(包含对object/person的描述),在时空中找到连续的bounding box (也就是一个tube)。 18 | 19 |
20 | 21 | ## Format 22 | 23 | Markdown format: 24 | 25 | ```markdown 26 | - [Paper Name](link) - Author 1 et al, `Conference Year`. [[code]](link) 27 | ``` 28 | 29 | ## Change Log 30 | 31 | * 2019/12/16: Add CBP (AAAI 2020) 32 | 33 | ## Table of Contents 34 | 35 | - [Papers](#papers) 36 | - [Survey](#survey) 37 | - [Before](#before) - [2017](#2017) - [2018](#2018) - [2019](#2019) - [2020](#2020) 38 | - [Dataset](#dataset) 39 | - [Benchmark Results](#benchmark-results) 40 | - [Popular Implementations](#popular-implementations) 41 | - [PyTorch](#pytorch) 42 | - [TensorFlow](#tensorflow) 43 | - [Other](#other) 44 | 45 | ## Papers 46 | 47 | ### Survey 48 | 49 | - None. 50 | 51 | ### Before 52 | 53 | - [Grounded Language Learning from Video Described with Sentences](https://www.aclweb.org/anthology/P13-1006/) - H. Yu et al, `ACL 2013`. 54 | - [Visual Semantic Search: Retrieving Videos via Complex Textual Queries]() - Dahua Lin et al, `CVPR 2014`. 55 | - [Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework](https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9734) - R. Xu et al, `AAAI 2015`. 56 | - [Unsupervised Alignment of Actions in Video with Text Descriptions](https://pdfs.semanticscholar.org/5893/7d427ff36e1470b18120245148355047e4ea.pdf) - Y. C. Song et al, `IJCAI 2016`. 57 | 58 | ### 2017 59 | 60 | - [Localizing Moments in Video with Natural Language](https://arxiv.org/abs/1708.01641) - Lisa Anne Hendricks et al, `ICCV 2017`. [[code]]() 61 | 62 | - [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101) - Jiyang Gao et al, `ICCV 2017`. [[code]](). 63 | 64 | - [Spatio-temporal Person Retrieval via Natural Language Queries](https://arxiv.org/abs/1704.07945) - M. Yamaguchi et al, `ICCV 2017`. [[code]]() 65 | 66 | * [Attention-based Natural Language Person Retrieval]() - Tao Zhou et al, `CVPR 2017`. 67 | * [Where to Play: Retrieval of Video Segments using Natural-Language Queries]() - S. Lee et al, `arxiv 2017`. 68 | 69 | ### 2018 70 | 71 | - [Find and Focus: Retrieve and Localize Video Events with Natural Language Queries]() - Dian Shao et al, `ECCV 2018`. 72 | - [Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos]() - B. Liu et al, `ECCV 2018`. 73 | - [Temporally Grounding Natural Sentence in Video]() - J. Chen et al, `EMNLP 2018`. 74 | - [Localizing Moments in Video with Temporal Language]() - Lisa Anne Hendricks et al, `EMNLP 2018`. 75 | - [Object Referring in Videos with Language and Human Gaze](https://arxiv.org/abs/1801.01582) - A. B. Vasudevan et al, `CVPR 2018`. [[code]](). 76 | - [Weakly Supervised Dense Event Captioning in Videos](https://arxiv.org/abs/1812.03849) - X. Duan et al, `NIPS 2018`. 77 | - [Actor and Action Video Segmentation from a Sentence]() - Kirill Gavrilyuk et al, `CVPR 2018`. 78 | - [Attentive Moment Retrieval in Videos](http://staff.ustc.edu.cn/~hexn/papers/sigir18-video-retrieval.pdf) - M. Liu et al, `SIGIR 2018`. 79 | 80 | ### 2019 81 | 82 | - [Multilevel Language and Vision Integration for Text-to-Clip Retrieval]() - H. Xu et al, `AAAI 2019`. [[code]]() 83 | - [Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos](https://arxiv.org/abs/1901.06829) - He, Dongliang et al, `AAAI 2019`. 84 | - [To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression](http://arxiv.org/abs/1804.07014) - Y. Yuan et al, `AAAI 2019`. [[code]](https://github.com/yytzsy/ABLR_code) 85 | - [Semantic Proposal for Activity Localization in Videos via Sentence Query](http://yugangjiang.info/publication/19AAAI-actionlocalization.pdf) - S. Chen et al, `AAAI 2019`. 86 | - [MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment](https://arxiv.org/abs/1812.00087) - Da Zhang et al, `CVPR 2019`. 87 | 88 | * [Weakly Supervised Video Moment Retrieval From Text Queries]() - N. C. Mithun et al, `CVPR 2019`. 89 | * [Language-Driven Temporal Activity Localization_ A Semantic Matching Reinforcement Learning Model]() - W. Wang et al, `CVPR 2019`. 90 | * [Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos](https://arxiv.org/pdf/1910.14303.pdf) - Yitian Yuan et al, `NIPS 2019`. [[code]](https://github.com/yytzsy/SCDM) 91 | * [WSLLN: Weakly Supervised Natural Language Localization Networks](https://arxiv.org/abs/1909.00239) - M. Gao et al, `EMNLP 2019`. 92 | * [ExCL: Extractive Clip Localization Using Natural Language Descriptions](https://arxiv.org/abs/1904.02755) - S. Ghosh et al, `NAACL 2019`. 93 | * [Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos](https://arxiv.org/abs/1906.02497) - Zhu Zhang et al, `SIGIR 2019`. [[code]](https://github.com/ikuinen/CMIN_moment_retrieval) 94 | * [Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention](https://dl.acm.org/citation.cfm?id=3325019) - B. Jiang et al, `ICMR 2019`. [[code]](https://github.com/BonnieHuangxin/SLTA) 95 | * [MAC: Mining Activity Concepts for Language-based Temporal Localization](https://arxiv.org/abs/1811.08925) - Runzhou Ge Ge et al, `WACV 2019`. [[code]](https://github.com/runzhouge/MAC) 96 | * [Temporal Localization of Moments in Video Collections with Natural Language](https://arxiv.org/abs/1907.12763v1) - V. Escorcia et al, `arxiv 2019`. 97 | * [Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention](https://arxiv.org/abs/1908.07236) - C. R. Opazo et al, `arxiv 2019`. 98 | * [Tripping through time: Efficient Localization of Activities in Videos](https://arxiv.org/abs/1904.09936) - Meera Hahn et al, `arxiv 2019`. 99 | * [Related] [Localizing Unseen Activities in Video via Image Query](https://arxiv.org/abs/1906.12165) - Zhu Zhang et al, `IJCAI 2019`. 100 | 101 | ### 2020 102 | 103 | * [Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction](https://arxiv.org/abs/1909.05010) - Jingwen Wang et al, `AAAI 2020`. [[code]](https://github.com/JaywongWang/CBP) 104 | 105 | 106 | 107 | ## Dataset 108 | 109 | - [ActivityNet Captions](http://cs.stanford.edu/people/ranjaykrishna/densevid/) 110 | - [Charades-STA]() 111 | - [DiDeMo]() 112 | - [TACoS](http://www.coli.uni-saarland.de/projects/smile/page.php?id=software) 113 | 114 | ## Benchmark Results 115 | 116 | #### ActivityNet Captions 117 | 118 | | | R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | Method | 119 | | :-------------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :----: | 120 | | MCN | 42.80 | 21.37 | 9.58 | - | - | - | - | - | PB | 121 | | CTRL | 49.09 | 28.70 | 14.0 | - | - | - | - | - | PB | 122 | | ACRN | 50.37 | 31.29 | 16.17 | - | - | - | - | - | PB | 123 | | QSPN | - | 45.3 | 27.7 | 13.6 | - | 75.7 | 59.2 | 38.3 | PB | 124 | | TGN | 70.06 | 45.51 | 28.47 | - | 79.10 | 57.32 | 44.20 | - | PB | 125 | | SCDM | - | 54.80 | 36.75 | 19.86 | - | 77.29 | 64.99 | 41.53 | PB | 126 | | CBP | - | 54.30 | 35.76 | 17.80 | - | 77.63 | 65.89 | 46.20 | PB | 127 | | TripNet | - | 48.42 | 32.19 | 13.93 | - | - | - | - | RL | 128 | | ABLR | 73.30 | 55.67 | 36.79 | - | - | - | - | - | RL | 129 | | ExCL | - | 63.30 | 43.6 | 24.1 | - | - | - | - | PF | 130 | | PFGA | 75.25 | 51.28 | 33.04 | 19.26 | - | - | - | - | PF | 131 | | WSDEC-X(Weakly) | 62.7 | 42.0 | 23.3 | - | - | - | - | - | | 132 | | WSLLN (Weakly) | 75.4 | 42.8 | 22.7 | - | - | - | - | - | | 133 | 134 | #### Charades-STA 135 | 136 | | | R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | Method | 137 | | :-----: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :----: | 138 | | CTRL | - | - | 23.63 | 8.89 | - | - | 58.92 | 29.52 | PB | 139 | | ABLR | - | - | 24.36 | 9.01 | - | - | - | - | PB | 140 | | SMRL | - | - | 24.36 | 11.17 | - | - | 61.25 | 32.08 | PB | 141 | | ACL-K | - | - | 30.48 | 12.20 | - | - | 64.84 | 35.13 | PB | 142 | | SAP | - | - | 27.42 | 13.36 | - | - | 66.37 | 38.15 | PB | 143 | | QSPN | - | 54.7 | 35.6 | 15.8 | - | 95.8 | 79.4 | 45.4 | PB | 144 | | MAN | - | - | 46.53 | 22.72 | - | - | 86.23 | 53.72 | PB | 145 | | SCDM | - | - | 54.44 | 33.43 | - | - | 74.43 | 58.08 | PB | 146 | | CBP | - | - | 36.80 | 18.87 | - | - | 70.94 | 50.19 | PB | 147 | | TripNet | - | 51.33 | 36.61 | 14.50 | - | - | - | - | RL | 148 | | ExCL | - | 65.1 | 44.1 | 23.3 | - | - | - | - | RL | 149 | | PFGA | - | 67.53 | 52.02 | 33.74 | - | - | - | - | PF | 150 | 151 | #### DiDeMo 152 | 153 | | | R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | 154 | | :------------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | 155 | | TMN | 22.92 | - | - | - | 76.08 | - | - | - | 156 | | MCN | 28.10 | - | - | - | 78.21 | - | - | - | 157 | | TGN | 28.23 | - | - | - | 79.26 | - | - | - | 158 | | MAN | 27.02 | - | - | - | 81.70 | - | - | - | 159 | | WSLLN (Weakly) | 19.4 | - | - | - | 54.4 | - | - | - | 160 | 161 | #### TACoS 162 | 163 | | | R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | Method | 164 | | :-----: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :---------: | :----: | 165 | | MCN | 2.62 | 1.64 | 1.25 | - | 2.88 | 1.82 | 1.01 | - | PB | 166 | | CTRL | 24.32 | 18.32 | 13.30 | - | 48.73 | 36.69 | 25.42 | - | PB | 167 | | TGN | 41.87 | 21.77 | 18.90 | - | 53.40 | 39.06 | 31.02 | - | PB | 168 | | ACRN | 24.22 | 19.52 | 14.62 | - | 47.42 | 34.97 | 24.88 | - | PB | 169 | | ACL-K | 31.64 | 24.17 | 20.01 | - | 57.85 | 42.15 | 30.66 | - | PB | 170 | | SCDM | - | 26.11 | 21.17 | - | - | 40.16 | 32.18 | - | PB | 171 | | CBP | - | 27.31 | 24.79 | 19.10 | - | 43.64 | 37.40 | 25.59 | PB | 172 | | TripNet | - | 23.95 | 19.17 | 9.52 | - | - | - | - | RL | 173 | | SMRL | 26.51 | 20.25 | 15.95 | - | 50.01 | 38.47 | 27.84 | - | RL | 174 | | ABLR | 34.7 | 19.5 | 9.4 | - | - | - | - | - | RL | 175 | | ExCL | - | 45.5 | 28.0 | 14.6 | - | - | - | - | PF | 176 | 177 | ## Popular Implementations 178 | 179 | ### PyTorch 180 | 181 | - [ikuinen/CMIN_moment_retrieval](https://github.com/ikuinen/CMIN_moment_retrieval) 182 | 183 | ### TensorFlow 184 | 185 | - [jiyanggao/TALL]() 186 | - [runzhouge/MAC](https://github.com/runzhouge/MAC) 187 | - [BonnieHuangxin/SLTA](https://github.com/BonnieHuangxin/SLTA) 188 | - [yytzsy/ABLR_code](https://github.com/yytzsy/ABLR_code) 189 | - [yytzsy/SCDM](https://github.com/yytzsy/SCDM) 190 | - [JaywongWang/TGN](https://github.com/JaywongWang/TGN) 191 | - [JaywongWang/CBP](https://github.com/JaywongWang/CBP) 192 | 193 | ### Others 194 | 195 | - None. 196 | 197 | ## Licenses 198 | 199 | [![CC0](http://i.creativecommons.org/p/zero/1.0/88x31.png)](http://creativecommons.org/publicdomain/zero/1.0/) 200 | 201 | To the extent possible under law, [muketong](https://github.com/iworldtong) all copyright and related or neighboring rights to this work. 202 | 203 | --------------------------------------------------------------------------------