├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Zheng-1994 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Image-Text Matching Summary 2 | Summary of Related Research on Image-Text Matching 3 | 4 | - [Papers](#Papers): [Conference](#Conference), [Journal](#Journal) 5 | - [Datasets](#Datasets): [Flickr30K](#Flickr30K), [MS-COCO](#MS-COCO) 6 | - [Performance](#Performance) 7 | 8 | ## Papers 9 | 10 | ### Conference 11 | 12 | #### 2023 13 | 14 | - `[2023 CVPR]` **Learning Semantic Relationship among Instances for Image-Text Matching (HREM)** 15 | *Zheren Fu, Zhendong Mao, Yan Song, Yongdong Zhang* 16 | [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Pan_Fine-Grained_Image-Text_Matching_by_Cross-Modal_Hard_Aligning_Network_CVPR_2023_paper.pdf) 17 | [[code]](https://github.com/CrossmodalGroup/HREM) 18 | 19 | - `[2023 CVPR]` **Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network (CHAN)** 20 | *Zhengxin Pan, Fangyu Wu, Bailing Zhang* 21 | [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Pan_Fine-Grained_Image-Text_Matching_by_Cross-Modal_Hard_Aligning_Network_CVPR_2023_paper.pdf) 22 | [[code]](https://github.com/ppanzx/CHAN) 23 | 24 | - `[2023 CVPR]` **BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency (BiCro)** 25 | *Shuo Yang, Zhaopan Xu, Kai Wang, Yang You, Hongxun Yao, Tongliang Liu, Min Xu* 26 | [[paper]](https://arxiv.org/pdf/2303.12419) 27 | [[code]](https://github.com/xu5zhao/BiCro) 28 | 29 | - `[2023 CVPR]` **Improving Cross-Modal Retrieval with Set of Diverse Embeddings** 30 | *Dongwon Kim, Namyup Kim, Suha Kwak* 31 | [[paper]](https://arxiv.org/pdf/2211.16761) 32 | 33 | - `[2023 SIGIR]` **Learnable Pillar-based Re-ranking for Image-Text Retrieval** 34 | *Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, Tat-Seng Chua* 35 | [[paper]](https://arxiv.org/pdf/2304.12570) 36 | 37 | - `[2023 SIGIR]` **Rethinking Benchmarks for Cross-modal Image-text Retrieval** 38 | *Weijing Chen, Linli Yao, Qin Jin* 39 | [[paper]](https://arxiv.org/pdf/2304.10824) 40 | 41 | - `[2023 WACV]` **Dissecting Deep Metric Learning Losses for Image-Text Retrieval** 42 | *Hong Xuan, Xi (Stephen) Chen* 43 | [[paper]](https://openaccess.thecvf.com/content/WACV2023/papers/Xuan_Dissecting_Deep_Metric_Learning_Losses_for_Image-Text_Retrieval_WACV_2023_paper.pdf) 44 | 45 | - `[2023 WACV]` **Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval (CMSEI)** 46 | *Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Joemon M. Jose* 47 | [[paper]](https://arxiv.org/pdf/2210.08908) 48 | 49 | - `[2023 WACV]` **More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching** 50 | *Yuxiao Chen, Jianbo Yuan, Long Zhao, Tianlang Chen, Rui Luo, Larry Davis, Dimitris N. Metaxas* 51 | [[paper]](https://arxiv.org/pdf/2105.09597) 52 | 53 | #### 2022 54 | 55 | - `[2022 ECCV]` **CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval (CODER)** 56 | *Haoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, Jingdong Wang* 57 | [[paper]](https://arxiv.org/pdf/2208.09843.pdf) 58 | 59 | - `[2022 CVPR]` **Negative-Aware Attention Framework for Image-Text Matching (NAAF)** 60 | *Kun Zhang, Zhendong Mao, Quan Wang, Yongdong Zhang* 61 | [[paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhang_Negative-Aware_Attention_Framework_for_Image-Text_Matching_CVPR_2022_paper.pdf) 62 | [[code]](https://github.com/CrossmodalGroup/NAAF) 63 | 64 | - `[2022 AAAI]` **Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching (CMCAN)** 65 | *Huatian Zhang, Zhendong Mao, Kun Zhang, Yongdong Zhang* 66 | [[paper]](https://www.aaai.org/AAAI22Papers/AAAI-2029.ZhangH.pdf) 67 | [[code]](https://github.com/CrossmodalGroup/CMCAN) 68 | 69 | - `[2022 IJCAI]` **Multi-View Visual Semantic Embedding (MV-VSE)** 70 | *Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Xijun Xue* 71 | [[paper]](https://www.ijcai.org/proceedings/2022/0158.pdf) 72 | 73 | - `[2022 IJCAI]` **Image-text Retrieval: A Survey on Recent Research and Development** 74 | *Min Cao, Shiping Li, Juntao Li, Liqiang Nie, Min Zhang* 75 | [[paper]](https://arxiv.org/pdf/2203.14713) 76 | 77 | - `[2022 SIGIR]` **Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval** 78 | *Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, Dacheng Tao* 79 | [[paper]](https://arxiv.org/pdf/2203.03853) 80 | [[code]](https://github.com/WangFei-2019/Image-text-Retrieval) 81 | 82 | #### 2021 83 | 84 | - `[2021 ICCV]` **Wasserstein Coupled Graph Learning for Cross-Modal Retrieval (WCGL)** 85 | *Yun Wang, Tong Zhang, Xueya Zhang, Zhen Cui, Yuge Huang, Pengcheng Shen, Shaoxin Li, Jian Yang* 86 | [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Wang_Wasserstein_Coupled_Graph_Learning_for_Cross-Modal_Retrieval_ICCV_2021_paper.pdf) 87 | 88 | - `[2021 CVPR]` **Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching** 89 | *Shiyang Yan, Li Yu, Yuan Xie* 90 | [[paper]](https://arxiv.org/abs/2104.10406) 91 | [[code]](https://github.com/Shiyang-Yan/Discrete-continous-PG-for-Retrieval) 92 | 93 | - `[2021 CVPR]` **Learning the Best Pooling Strategy for Visual Semantic Embedding (GPO)** 94 | *Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, Changhu Wang* 95 | [[paper]](https://arxiv.org/pdf/2011.04305) 96 | [[code]](https://github.com/woodfrog/vse_infty) 97 | 98 | - `[2021 AAAI]` **Similarity Reasoning and Filtration for Image-Text Matching (SGRAF)** 99 | *Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu* 100 | [[paper]](https://arxiv.org/pdf/2101.01368) 101 | [[code]](https://github.com/Paranioar/SGRAF) 102 | 103 | #### 2020 104 | 105 | - `[2020 CVPR]` **Graph Structured Network for Image-Text Matching (GSMN)** 106 | *Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang* 107 | [[paper]](http://openaccess.thecvf.com/content_CVPR_2020/papers/Liu_Graph_Structured_Network_for_Image-Text_Matching_CVPR_2020_paper.pdf) 108 | [[code]](https://github.com/CrossmodalGroup/GSMN) 109 | 110 | - `[2020 CVPR]` **IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval (IMRAM)** 111 | *Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han* 112 | [[paper]](https://arxiv.org/abs/2003.03772) 113 | [[code]](https://github.com/HuiChen24/IMRAM) 114 | 115 | - `[2020 CVPR]` **Context-Aware Attention Network for Image-Text Retrieval (CAAN)** 116 | *Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li* 117 | [[paper]](http://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Context-Aware_Attention_Network_for_Image-Text_Retrieval_CVPR_2020_paper.pdf) 118 | 119 | - `[2020 CVPR]` **Multi-Modality Cross Attention Network for Image and Sentence Matching (MMCA)** 120 | *Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu* 121 | [[paper]](http://openaccess.thecvf.com/content_CVPR_2020/papers/Wei_Multi-Modality_Cross_Attention_Network_for_Image_and_Sentence_Matching_CVPR_2020_paper.pdf) 122 | 123 | - `[2020 CVPR]` **Universal Weighting Metric Learning for Cross-Modal Matching** 124 | *Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen* 125 | [[paper]](http://openaccess.thecvf.com/content_CVPR_2020/papers/Wei_Universal_Weighting_Metric_Learning_for_Cross-Modal_Matching_CVPR_2020_paper.pdf) 126 | [[code]](https://github.com/wayne980/PolyLoss) 127 | 128 | - `[2020 ECCV]` **Consensus-Aware Visual-Semantic Embedding for Image-Text Matching (CVSE)** 129 | *Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma* 130 | [[paper]](https://arxiv.org/pdf/2007.08883) 131 | [[code]](https://github.com/BruceW91/CVSE) 132 | 133 | - `[2020 ECCV]` **Adaptive Offline Quintuplet Loss for Image-Text Matching (AOQ)** 134 | *Tianlang Chen, Jiajun Deng, Jiebo Luo* 135 | [[paper]](https://arxiv.org/abs/2003.03669) 136 | [[code]](https://github.com/sunnychencool/AOQ) 137 | 138 | #### 2019 139 | 140 | - `[2019 ICCV]` **Visual Semantic Reasoning for Image-Text Matching (VSRN)** 141 | *Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu* 142 | [[paper]](https://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Visual_Semantic_Reasoning_for_Image-Text_Matching_ICCV_2019_paper.pdf) 143 | [[code]](https://github.com/KunpengLi1994/VSRN) 144 | 145 | - `[2019 ICCV]` **CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval (CAMP)** 146 | *Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao* 147 | [[paper]](https://arxiv.org/abs/1909.05506) 148 | [[code]](https://github.com/ZihaoWang-CV/CAMP_iccv19) 149 | 150 | - `[2019 ICCV]` **Saliency-Guided Attention Network for Image-Sentence Matching (SAN)** 151 | *Zhong Ji, Haoran Wang, Jungong Han, Yanwei Pang* 152 | [[paper]](https://arxiv.org/abs/1904.09471) 153 | [[code]](https://github.com/HabbakukWang1103/SAN) 154 | 155 | - `[2019 ICCV]` **Language-Agnostic Visual-Semantic Embeddings (LIWE)** 156 | *Jonatas Wehrmann, Maurício Armani Lopes, Douglas Souza, Rodrigo Barros* 157 | [[paper]](https://openaccess.thecvf.com/content_ICCV_2019/papers/Wehrmann_Language-Agnostic_Visual-Semantic_Embeddings_ICCV_2019_paper.pdf) 158 | [[code]](https://github.com/jwehrmann/lavse) 159 | 160 | - `[2019 CVPR]` **Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (PVSE)** 161 | *Yale Song, Mohammad Soleymani* 162 | [[paper]](https://arxiv.org/abs/1906.04402) 163 | [[code]](https://github.com/yalesong/pvse) 164 | 165 | - `[2019 ACM MM]` **Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching (BFAN)** 166 | *Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, Yongdong Zhang* 167 | [[paper]](https://arxiv.org/abs/1909.11416) 168 | [[code]](https://github.com/CrossmodalGroup/BFAN) 169 | 170 | - `[2019 IJCAI]` **Position Focused Attention Network for Image-Text Matching (PFAN)** 171 | *Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan* 172 | [[paper]](https://arxiv.org/pdf/1907.09748) 173 | [[code]](https://github.com/HaoYang0123/Position-Focused-Attention-Network) 174 | 175 | #### 2018 176 | 177 | - `[2018 ECCV]` **Stacked Cross Attention for Image-Text Matching (SCAN)** 178 | *Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He* 179 | [[paper]](http://openaccess.thecvf.com/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf) 180 | [[code]](https://github.com/kuanghuei/SCAN) 181 | 182 | - `[2018 BMVC]` **VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (VSE++)** 183 | *Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler* 184 | [[paper]](https://arxiv.org/pdf/1707.05612) 185 | [[code]](https://github.com/fartashf/vsepp) 186 | 187 | ### Journal 188 | 189 | #### 2023 190 | 191 | - `[2023 TPAMI]` **Cross-Modal Retrieval with Partially Mismatched Pairs (RCL)** 192 | *Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, Xi Peng* 193 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10050111) 194 | [[code]](https://github.com/penghu-cs/RCL) 195 | 196 | - `[2023 TIP]` **Plug-and-Play Regulators for Image-Text Matching (RCAR)** 197 | *Haiwen Diao, Ying Zhang, Wei Liu, Xiang Ruan, Huchuan Lu* 198 | [[paper]](https://arxiv.org/pdf/2303.13371) 199 | [[code]](https://github.com/Paranioar/RCAR) 200 | 201 | - `[2023 TMM]` **Integrating Language Guidance into Image-Text Matching for Correcting False Negatives (LG)** 202 | *Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Zhongtian Du* 203 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10081045) 204 | [[code]](https://github.com/AAA-Zheng/LG_ITM) 205 | 206 | - `[2023 TMM]` **Inter-Intra Modal Representation Augmentation with DCT-Transformer Adversarial Network for Image-Text Matching (DTAN)** 207 | *Chen Chen, Dan Wang, Bin Song, Hao Tan* 208 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10041445) 209 | 210 | #### 2022 211 | 212 | - `[2022 TIP]` **Adaptive Latent Graph Representation Learning for Image-Text Matching** 213 | *Mengxiao Tian, Xinxiao Wu, Yunde Jia* 214 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9991857) 215 | 216 | - `[2022 TMM]` **Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching (UARDA)** 217 | *Kun Zhang, Zhendong Mao, Anan Liu, Yongdong Zhang* 218 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9676463) 219 | 220 | - `[2022 TCSVT]` **Hierarchical Feature Aggregation Based on Transformer for Image-Text Matching (HAT)** 221 | *Xinfeng Dong, Huaxiang Zhang, Lei Zhu, Liqiang Nie, Li Liu* 222 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9745936) 223 | 224 | #### 2020 225 | 226 | - `[2020 TOMM]` **Dual-path Convolutional Image-Text Embeddings with Instance Loss** 227 | *Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, YiDong Shen* 228 | [[paper]](https://arxiv.org/pdf/1711.05535) 229 | [[code]](https://github.com/layumi/Image-Text-Embedding) 230 | 231 | - `[2020 TNNLS]` **Cross-Modal Attention With Semantic Consistence for Image–Text Matching (CASC)** 232 | *Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, Heng Tao Shen* 233 | [[paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8994196) 234 | 235 | 236 | ## Datasets 237 | ### Flickr30K 238 | `[2014 TACL]` **From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions** 239 | *Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier* 240 | [[paper]](https://aclanthology.org/Q14-1006.pdf) 241 | 242 | ### MS-COCO 243 | `[2014 ECCV]` **Microsoft COCO: Common Objects in Context** 244 | *Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár & C. Lawrence Zitnick* 245 | [[paper]](https://projet.liris.cnrs.fr/imagine/pub/proceedings/ECCV-2014/papers/8693/86930740.pdf) 246 | 247 | ## Performance 248 | 249 | ### Performance on Flickr30K 250 |
| Model | 253 |Reference | 254 |Image Encoder | 255 |Text Encoder | 256 |Image-to-Text | 257 |Text-to-Image | 258 |RSUM | 259 |||||
| R@1 | 262 |R@5 | 263 |R@10 | 264 |R@1 | 265 |R@5 | 266 |R@10 | 267 ||||||
| VSE++ | 270 |2018 BMVC | 271 |ResNet-152 | 272 |GRU | 273 |52.9 | 80.5 | 87.2 | 39.6 | 70.1 | 79.5 | 409.8 | 274 |
| SCAN | 277 |2018 ECCV | 278 |BUTD | 279 |Bi-GRU | 280 |67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 465.0 | 281 |
| VSRN | 284 |2019 ICCV | 285 |BUTD | 286 |GRU | 287 |71.3 | 90.6 | 96.0 | 54.7 | 81.8 | 88.2 | 482.6 | 288 |
| GSMN | 291 |2020 CVPR | 292 |BUTD | 293 |Bi-GRU | 294 |76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | 496.8 | 295 |
| SGRAF | 298 |2021 AAAI | 299 |BUTD | 300 |Bi-GRU | 301 |77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 499.6 | 302 |
| NAAF | 305 |2022 CVPR | 306 |BUTD | 307 |Bi-GRU | 308 |81.9 | 96.1 | 98.3 | 61.0 | 85.3 | 90.6 | 513.2 | 309 |