├── overview.jpg └── README.md /overview.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/li-xirong/video-retrieval/HEAD/overview.jpg -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning for Video Retrieval by Natural Language 2 | 3 | Videos are everywhere. Video retrieval, i.e., finding videos that meet the information need of a specific user, is important for a wide range of applications including communication, education, entertainment, business, security etc. Among multiple ways of expressing the information need, a natural-language text is the most intuitive to start a retrieval process. For instance, to find video shots showing ``a person in front of a blackboard talking or writing in a classroom``. Such a query can be submitted easily, by typing or speech recognition, to a video retrieval system. Given a video as a sequence of frames and a query as a sequence of words, a fundamental problem in video retrieval by natural language is ***how to properly associate visual and linguistic information presented in sequential order***. 4 | 5 | This page maintains an (incomplete) list of state-of-the-art open-source methods and datasets, with the [TRECVID](https://trecvid.nist.gov/) Ad-hoc Video Search (AVS) benchmark evaluation as the test bed. 6 | 7 | ## Open-source methods 8 | * [W2VV, T-MM'18](https://github.com/danieljf24/w2vv) 9 | * [W2VV++, ACMMM'19](https://github.com/li-xirong/w2vvpp) 10 | * [Dual Encoding, CVPR'19](https://github.com/danieljf24/dual_encoding) 11 | 12 | ## Datasets 13 | 14 | * [Datasets for AVS](https://github.com/li-xirong/avs) 15 | 16 | 17 | ## Leaderboard 18 | 19 | ### TRECVID 2016 AVS 20 | 21 | | Method | infAP | 22 | |:-- | ---:| 23 | | Dual Encoding (Dong et al. CVPR'19) | 0.159 | 24 | | W2VV++ (Li et al. MM'19) | 0.151 | 25 | | VSE++ (Faghri et al. BMVC'18, *produced by Li et al. MM'19*) | 0.123 | 26 | | VideoStory (Habibian et al. PAMI'16) | 0.087 | 27 | | Markatopoulou et al. ICMR'17 | 0.064 | 28 | | Le et al. TRECVID'16 | 0.054 | 29 | | Markatopoulou et al. TRECVID'16 | 0.051 | 30 | | W2VV (Dong et al. T-MM'18, *produced by Li et al. MM'19*) | 0.050 | 31 | 32 | 33 | ### TRECVID 2017 AVS 34 | 35 | | Method | infAP | 36 | |:-- | ---:| 37 | | W2VV++ (Li et al. MM'19) | 0.213 | 38 | | Dual Encoding (Dong et al. CVPR'19) | 0.208 | 39 | | Snoek et al. TRECVID'17 | 0.206 | 40 | | Ueki et al. TRECVID'17 | 0.159 | 41 | | VSE++ (Faghri et al. BMVC'18, *produced by Li et al. MM'19*) | 0.154 | 42 | | VideoStory (Habibian et al. PAMI'17) | 0.150 | 43 | | Nguyen et al. TRECVID'17 | 0.120 | 44 | | W2VV (Dong et al. T-MM'18, *produced by Li et al. MM'19*) | 0.081 | 45 | 46 | ### TRECVID 2018 AVS 47 | 48 | | Method | infAP | 49 | |:-- | ---:| 50 | | Dual Encoding (Dong et al. CVPR'19) | 0.126 | 51 | | Li et al. TRECVID'18 | 0.121 | 52 | | W2VV++ (Li et al. MM'19) | 0.106 | 53 | | Huang et al. TRECVID'18 | 0.087 | 54 | | Bastan et al. TRECVID'18 | 0.082 | 55 | | VSE++ (Faghri et al. BMVC'18, *produced by Li et al. MM'19*) | 0.074 | 56 | 57 | ### TRECVID 2019 AVS 58 | 59 | work in progress ... 60 | 61 | ## References 62 | 63 | + [Bastan et al. TRECVID'18] M. Bastan, X. Shi, J. Gu, Z. Heng, C. Zhuo, D. Sng, and A. Kot. NTU ROSE Lab at TRECVID 2018: Ad-hoc Video Search and Video to Text. TRECVID 2018 64 | + [Dong et al. CVPR'19] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang. Dual Encoding for 65 | Zero-Example Video Retrieval. CVPR 2019 66 | + [Dong et al. TMM'18] J. Dong, X. Li, and C. Snoek. PredictingVisualFeaturesfromTextfor 67 | Image and Video Caption Retrieval. T-MM 20, 12 (2018), 3377–3388 68 | + [Faghri et al. BMVC'18] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. VSE++: Improving Visual- 69 | Semantic Embeddings with Hard Negatives. BMVC 2018 70 | + [Habibian et al. T-PAMI'17] A. Habibian, T. Mensink, and C. G. M. Snoek. Video2vec Embeddings 71 | Recognize Events When Examples Are Scarce. T-PAMI 39, 10 (2017), 2089–2103 72 | + [Huang et al. TRECVID'18] P.-Y. Huang, J. Liang, V. Vaibhav, X. Chang, and A. Hauptmann. Informedia@TRECVID 2018: Ad-hoc Video Search with Discrete and Continuous Representations. TRECVID 2018 73 | + [Le et al. TRECVID'16] D.-D. Le, S.Phan, V.-T. Nguyen, B. Renoust, T. Nguyen, V.-N. Hoang, T. Ngo, M.-T. Tran, Y. Watanabe, M. Klinkigt, et al. NII-HITACHI-UIT at TRECVID 2016. TRECVID 2016 74 | + [Li et al. MM'19] X. Li, C. Xu, G. Yang, Z. Chen, and J. Dong. W2VV++: Fully Deep Learning for Ad-hoc Video Search, ACMMM 2019 75 | + [Li et al. TRECVID'18] X. Li, J. Dong, C. Xu, J. Cao, X. Wang, G. Yang. Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. TRECVID 2018 76 | + [Markatopoulou et al. ICMR'17] F. Markatopoulou, D. Galanopoulos, V. Mezaris, and I. Patras. Query and Keyframe Representations for Ad-hoc Video Search. ICMR 2017 77 | + [Markatopoulou et al. TRECVID'16] F. Markatopoulou, A.Moumtzidou, D.Galanopoulos, T.Mironidis, V.Kaltsa, A. Ioannidou, S. Symeonidis, K. Avgerinakis, S. Andreadis, et al. ITI-CERTH Participation in TRECVID 2016. TRECVID 2016 78 | + [Snoek et al. TRECVID'17] C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. University of Amsterdam and Renmin University at TRECVID 2017: Searching Video, Detecting Events and Describing Video. TRECVID 2017 79 | --------------------------------------------------------------------------------