├── overview.png ├── Large_Multimodal_Models_Evaluation__A_Survey_Preprint.pdf └── README.md /overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiben-ch/LMM-Evaluation-Survey/HEAD/overview.png -------------------------------------------------------------------------------- /Large_Multimodal_Models_Evaluation__A_Survey_Preprint.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aiben-ch/LMM-Evaluation-Survey/HEAD/Large_Multimodal_Models_Evaluation__A_Survey_Preprint.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Large Multimodal Models Evaluation: A Survey 2 | 3 | This repository complements the paper *Large Multimodal Models Evaluation: A Survey* and organizes benchmarks and resources across understanding (general and specialized), generation, and community platforms. It serves as a hub for researchers to find key datasets, papers, and code. 4 | 5 | **We will continuously maintain and update this repo to ensure long-term value for the community.** 6 | 7 | ![Overview](overview.png) 8 | 9 | 10 | **Paper:** [SCIS](https://www.sciengine.com/SCIS/doi/10.1007/s11432-025-4676-4) 11 | **Project Page:** [AIBench / LMM Evaluation Survey](https://github.com/aiben-ch/LMM-Evaluation-Survey) 12 | 13 | --- 14 | ## Contributions 15 | 16 | We welcome pull requests (PRs)! If you contribute five or more valid benchmarks with relevant details, your contribution will be acknowledged in the next update of the paper's Acknowledgment section. 17 | 18 | Come on and join us !! 19 | 20 | If you find our work useful, please give us a star. Thank you !! 21 | 22 | --- 23 | 24 | ## 📖 Citation 25 | 26 | If you find our work useful, please cite our paper as: 27 | 28 | ```bibtex 29 | @article{zhang2025large, 30 | author = {Zhang, Zicheng and Wang, Junying and Wen, Farong and Guo, Yijin and Zhao, Xiangyu and Fang, Xinyu and Ding, Shengyuan and Jia, Ziheng and Xiao, Jiahao and Shen, Ye and Zheng, Yushuo and Zhu, Xiaorong and Wu, Yalun and Jiao, Ziheng and Sun, Wei and Chen, Zijian and Zhang, Kaiwei and Fu, Kang and Cao, Yuqin and Hu, Ming and Zhou, Yue and Zhou, Xuemei and Cao, Juntai and Zhou, Wei and Cao, Jinyu and Li, Ronghui and Zhou, Donghao and Tian, Yuan and Zhu, Xiangyang and Li, Chunyi and Wu, Haoning and Liu, Xiaohong and He, Junjun and Zhou, Yu and Liu, Hui and Zhang, Lin and Wang, Zesheng and Duan, Huiyu and Zhou, Yingjie and Min, Xiongkuo and Jia, Qi and Zhou, Dongzhan and Zhang, Wenlong and Cao, Jiezhang and Yang, Xue and Yu, Junzhi and Zhang, Songyang and Duan, Haodong and Zhai, Guangtao}, 31 | title = {Large Multimodal Models Evaluation: A Survey}, 32 | journal = {SCIENCE CHINA Information Sciences}, 33 | year = {2025}, 34 | volume = {}, 35 | pages = {}, 36 | url = {https://www.sciengine.com/SCIS/doi/10.1007/s11432-025-4676-4}, 37 | doi = {https://doi.org/10.1007/s11432-025-4676-4} 38 | } 39 | ``` 40 | 41 | 42 | ## Table of Contents 43 | 44 | - [Large Multimodal Models Evaluation: A Survey](#large-multimodal-models-evaluation-a-survey) 45 | - [Contributions](#contributions) 46 | - [📖 Citation](#-citation) 47 | - [Table of Contents](#table-of-contents) 48 | - [Understanding Evaluation](#understanding-evaluation) 49 | - [General](#general) 50 | - [Adaptability](#adaptability) 51 | - [Basic Ability](#basic-ability) 52 | - [Comprehensive Perception](#comprehensive-perception) 53 | - [General Knowledge](#general-knowledge) 54 | - [Safety](#safety) 55 | - [Specialized](#specialized) 56 | - [Math](#math) 57 | - [Physics](#physics) 58 | - [Chemistry](#chemistry) 59 | - [Finance](#finance) 60 | - [Healthcare \& Medical Science](#healthcare--medical-science) 61 | - [Code](#code) 62 | - [Autonomous Driving](#autonomous-driving) 63 | - [Earth Science / Remote Sensing](#earth-science--remote-sensing) 64 | - [Embodied Tasks](#embodied-tasks) 65 | - [AI Agent](#ai-agent) 66 | - [Generation Evaluation](#generation-evaluation) 67 | - [Image](#image) 68 | - [Video](#video) 69 | - [Audio](#audio) 70 | - [3D](#3d) 71 | - [Leaderboards and Tools](#leaderboards-and-tools) 72 | 73 | --- 74 | 75 | ## Understanding Evaluation 76 | 77 | ### General 78 | 79 | #### Adaptability 80 | 81 | | Benchmark | Paper | Project Page | 82 | | :-------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 83 | | LLaVA-Bench | [Visual instruction tuning](https://arxiv.org/abs/2304.08485) | [Github](https://github.com/haotian-liu/LLaVA) | 84 | | MIA-Bench | [Mia-bench: Towards better instruction following evaluation of multimodal llms](https://arxiv.org/abs/2407.01509) | [Github](https://github.com/apple/ml-mia-bench) | 85 | | MM-IFEval | [MM-IFEngine: Towards Multimodal Instruction Following](https://arxiv.org/abs/2504.07957) | [Github](https://github.com/SYuan03/MM-IFEngine) | 86 | | VisIT-Bench | [VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use](https://arxiv.org/pdf/2308.06595) | [Github](https://github.com/mlfoundations/VisIT-Bench/) | 87 | | MMDU | [MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs](https://arxiv.org/abs/2406.11833) | [Github](https://github.com/Liuziyu77/MMDU) | 88 | | ConvBench | [ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models](https://arxiv.org/pdf/2403.20194) | [Github](https://github.com/shirlyliu64/ConvBench) | 89 | | SIMMC 2.0 | [SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations](https://arxiv.org/pdf/2104.08667) | [Github](https://github.com/facebookresearch/simmc2) | 90 | | Mementos | [Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences](https://arxiv.org/pdf/2401.10529) | [Github](https://github.com/umd-huang-lab/Mementos) | 91 | | MUIRBENCH | [MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding](https://arxiv.org/pdf/2406.09411) | [Github](https://github.com/muirbench/MuirBench) | 92 | | MMIU | [MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models](https://arxiv.org/pdf/2408.02718) | [Github](https://github.com/OpenGVLab/MMIU) | 93 | | MIRB | [Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning ](http://arxiv.org/pdf/2406.12742) | [Github](https://github.com/ys-zong/MIRB) | 94 | | MIBench | [MIBench: Evaluating Multimodal Large Language Models over Multiple Images](https://arxiv.org/abs/2407.15272) | [Hugging Face](https://huggingface.co/datasets/StarBottle/MIBench) | 95 | | II-Bench | [II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models](https://arxiv.org/abs/2406.05862) | [Github](https://github.com/II-Bench/II-Bench) | 96 | | Mantis | [Mantis: Interleaved Multi-Image Instruction Tuning](https://arxiv.org/pdf/2405.01483) | [Github](https://github.com/TIGER-AI-Lab/Mantis) | 97 | | MileBench | [MILEBENCH: Benchmarking MLLMs in Long Context](https://arxiv.org/pdf/2404.18532) | [Github](https://github.com/MileBench/MileBench) | 98 | | ReMI | [ReMI: A Dataset for Reasoning with Multiple Images](https://arxiv.org/pdf/2406.09175) | [Hugging Face](https://huggingface.co/datasets/mehrankazemi/ReMI) | 99 | | CODIS | [CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models](https://arxiv.org/pdf/2402.13607) | [Github](https://github.com/THUNLP-MT/CODIS) | 100 | | SPARKLES | [SPARKLES: UNLOCKING CHATS ACROSS MULTIPLE IMAGES FOR MULTIMODAL INSTRUCTION-FOLLOWING MODELS ](https://arxiv.org/pdf/2308.16463) | [Github](https://github.com/HYPJUDY/Sparkles) | 101 | | MMIE | [MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISIONLANGUAGE MODELS](https://arxiv.org/pdf/2410.10139) | [Github](https://github.com/Lillianwei-h/MMIE) | 102 | | InterleavedBench | [Holistic Evaluation for Interleaved Text-and-Image Generation](https://arxiv.org/pdf/2406.14643) | [Hugging Face](https://huggingface.co/mqliu/InterleavedBench) | 103 | | OpenING | [OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation](https://arxiv.org/pdf/2411.18499) | [Github](https://github.com/LanceZPF/OpenING) | 104 | | HumaniBench | [HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation](https://arxiv.org/pdf/2505.11454) | [Github](https://github.com/VectorInstitute/HumaniBench) | 105 | | Herm-Bench | [HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding](https://arxiv.org/pdf/2410.06777) | [Github](https://github.com/ZJHTerry18/Human-Centric-MLLM) | 106 | | UNIAA | [UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark](https://arxiv.org/pdf/2404.09619) | [Github](https://github.com/KwaiVGI/Uniaa) | 107 | | Humanbeauty | [HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment](https://arxiv.org/pdf/2503.23907) | [Github](https://github.com/KwaiVGI/HumanAesExpert) | 108 | | SocialIQA | [SOCIAL IQA: Commonsense Reasoning about Social Interactions](https://arxiv.org/pdf/1904.09728) | [Hugging Face](https://huggingface.co/datasets/allenai/social_i_qa) | 109 | | EmpathicStories | [EmpathicStories++: A Multimodal Dataset for Empathy towards Personal Experiences](https://arxiv.org/pdf/2405.15708) | [Dataset Download](https://mitmedialab.github.io/empathic-stories-multimodal/) | 110 | | Chatbot Arena | [Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference](https://arxiv.org/pdf/2403.04132) | *Not available* | 111 | | OpenAssistant Conversations | [OpenAssistant Conversations - Democratizing Large Language Model Alignment ](https://arxiv.org/pdf/2304.07327) | [Github](https://github.com/LAION-AI/Open-Assistant) | 112 | | HCE | [Human-Centric Evaluation for Foundation Models ](https://arxiv.org/pdf/2506.01793) | [Github](https://github.com/yijinguo/Human-Centric-Evaluation) | 113 | 114 | #### Basic Ability 115 | 116 | | Benchmark | Paper | Project Page | 117 | | :---------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 118 | | NWPU-MOC | [ NWPU-MOC: A Benchmark for Fine-grained Multi-category Object Counting in Aerial Images](https://arxiv.org/pdf/2401.10530) | [Github](https://github.com/lyongo/NWPU-MOC) | 119 | | T2V-ComBench | [T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation](https://arxiv.org/pdf/2407.14505) | [Github](https://github.com/KaiyueSun98/T2V-CompBench/tree/V2) | 120 | | ConceptMix | [ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty](https://arxiv.org/pdf/2408.14339) | [Github](https://github.com/princetonvisualai/ConceptMix) | 121 | | PICD | [Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhao_Can_Machines_Understand_Composition_Dataset_and_Benchmark_for_Photographic_Image_CVPR_2025_paper.pdf) | [Github](https://github.com/CV-xueba/PICD_ImageComposition) | 122 | | TextVQA | [Towards VQA Models That Can Read](https://arxiv.org/pdf/1904.08920) | [Github](https://github.com/facebookresearch/mmf) | 123 | | OCR-VQA | [OCR-VQA: Visual Question Answering by Reading Text in Images](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8978122) | [Dataset download](https://ocr-vqa.github.io/) | 124 | | OCRBench | [OCRBENCH: ON THE HIDDEN MYSTERY OF OCR IN LARGE MULTIMODAL MODELS](https://arxiv.org/pdf/2305.07895) | [Github](https://github.com/Yuliang-Liu/MultimodalOCR) | 125 | | OCRBench v2 | [OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning](https://arxiv.org/pdf/2501.00321) | [Github](https://github.com/Yuliang-Liu/MultimodalOCR) | 126 | | ASCIIEval | [VISUAL PERCEPTION IN TEXT STRINGS](https://arxiv.org/pdf/2410.01733) | [Github](https://github.com/JiaQiSJTU/VisionInText) | 127 | | OCRReasoning | [OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning ](https://arxiv.org/pdf/2505.17163) | [Github](https://github.com/SCUT-DLVCLab/OCR-Reasoning) | 128 | | M4-ViteVQA | [Towards Video Text Visual Question Answering: Benchmark and Baseline](https://proceedings.neurips.cc/paper_files/paper/2022/file/e726197ffd401df4013cd9f81007b5cf-Paper-Datasets_and_Benchmarks.pdf) | [Github](https://github.com/bytedance/VTVQA) | 129 | | SEED-Bench-2-Plus | [SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension](https://arxiv.org/pdf/2404.16790) | [Github](https://github.com/AILab-CVC/SEED-Bench) | 130 | | MMDocBench | [MMDOCBENCH: BENCHMARKING LARGE VISIONLANGUAGE MODELS FOR FINE-GRAINED VISUAL DOCUMENT UNDERSTANDING](https://arxiv.org/pdf/2410.21311) | [Github](https://github.com/fengbinzhu/MMDocBench) | 131 | | MMLongBench-Doc | [MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations ](https://arxiv.org/pdf/2407.01523) | [Github](https://github.com/mayubo2333/MMLongBench-Doc) | 132 | | UDA | [UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis](https://arxiv.org/pdf/2406.15187) | [Github](https://github.com/qinchuanhui/UDA-Benchmark) | 133 | | VisualMRC | [VisualMRC: Machine Reading Comprehension on Document Images ](https://cdn.aaai.org/ojs/17635/17635-13-21129-1-2-20210518.pdf) | [Github](https://github.com/nttmdlab-nlp/VisualMRC) | 134 | | DocVQA | [DocVQA: A Dataset for VQA on Document Images](https://arxiv.org/pdf/2007.00398) | [Hugging Face](https://huggingface.co/datasets/eliolio/docvqa) | 135 | | DocGenome | [DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models](https://arxiv.org/pdf/2406.11633) | [Github](https://github.com/Alpha-Innovator/DocGenome) | 136 | | GDI-Bench | [GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling](https://arxiv.org/pdf/2505.00063) | [Dataset download](https://knowledgexlab.github.io/gdibench.github.io/) | 137 | | AitW | [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/pdf/2307.10088) | [Github](https://github.com/google-research/google-research/tree/master/android_in_the_wild) | 138 | | ScreenSpot | [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/pdf/2401.10935) | [Github](https://github.com/njucckevin/SeeClick) | 139 | | VisualWebBench | [VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?](https://arxiv.org/pdf/2404.05955) | [Github](https://github.com/VisualWebBench/VisualWebBench) | 140 | | GUI-WORLD | [GUI-WORLD: A VIDEO BENCHMARK AND DATASET FOR MULTIMODAL GUI-ORIENTED UNDERSTANDING](https://arxiv.org/pdf/2406.10819) | [Github](https://github.com/Dongping-Chen/GUI-World) | 141 | | WebUIBench | [WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code ](https://www.arxiv.org/pdf/2506.07818) | [Github](https://github.com/MAIL-Tele-AI/WebUIBench) | 142 | | ScreenQA | [ScreenQA: Large-Scale Question-Answer Pairs Over Mobile App Screenshots ](https://arxiv.org/pdf/2209.08199) | [Github](https://github.com/google-research-datasets/screen_qa) | 143 | | ChartQA | [ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning](https://arxiv.org/pdf/2203.10244) | [Github](https://github.com/vis-nlp/ChartQA) | 144 | | ChartQApro | [CHARTQAPRO : A More Diverse and Challenging Benchmark for Chart Question Answering](https://arxiv.org/pdf/2504.05506) | [Github](https://github.com/vis-nlp/ChartQAPro) | 145 | | ComTQA | [TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy](https://arxiv.org/pdf/2406.01326) | [Github](https://github.com/sakura2233565548/TabPedia) | 146 | | TableVQA-Bench | [TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains ](https://arxiv.org/pdf/2404.19205) | [Github](https://github.com/naver-ai/tablevqabench) | 147 | | CharXiv | [CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs](https://arxiv.org/pdf/2406.18521) | [Github](https://github.com/princeton-nlp/CharXiv) | 148 | | SciFIBench | [SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation](https://arxiv.org/pdf/2405.08807) | [Github](https://github.com/jonathan-roberts1/SciFIBench) | 149 | | AI2D-RST | [AI2D-RST: A multimodal corpus of 1000 primary school science diagrams](https://arxiv.org/pdf/1912.03879) | [Github](https://github.com/thiippal/AI2D-RST) | 150 | | InfoChartQA | [InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts](https://arxiv.org/pdf/2505.19028) | [Github](https://github.com/CoolDawnAnt/InfoChartQA) | 151 | | EvoChart-QA | [EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding](https://arxiv.org/pdf/2409.01577) | [Github](https://github.com/MuyeHuang/EvoChart) | 152 | | WikiMixQA | [WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts](https://arxiv.org/pdf/2506.15594) | [Github](https://github.com/negar-foroutan/WikiMixQA) | 153 | | ChartX | [ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning](https://arxiv.org/pdf/2402.12185) | [Github](https://github.com/Alpha-Innovator/ChartVLM) | 154 | | Q-Bench | [Q-BENCH: A BENCHMARK FOR GENERAL-PURPOSE FOUNDATION MODELS ON LOW-LEVEL VISION ](https://arxiv.org/pdf/2309.14181) | [Github](https://github.com/Q-Future/Q-Bench) | 155 | | A-Bench | [A-BENCH: ARE LMMS MASTERS AT EVALUATING AI-GENERATED IMAGES?](https://arxiv.org/pdf/2406.03070) | [Github](https://github.com/Q-Future/A-Bench) | 156 | | MVP-Bench | [MVP-Bench: Can Large Vision–Language Models Conduct Multi-level Visual Perception Like Humans?](https://arxiv.org/pdf/2410.04345) | [Github](https://github.com/GuanzhenLi/MVP-Bench) | 157 | | XLRS-Bench | [XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?](https://arxiv.org/pdf/2503.23771) | [Github](https://github.com/AI9Stars/XLRS-Bench) | 158 | | HR-Bench | [Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models ](https://arxiv.org/pdf/2408.15556) | [Github](https://github.com/DreamMr/HR-Bench) | 159 | | MME-RealWorld | [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://arxiv.org/pdf/2408.13257) | [Github](https://github.com/MME-Benchmarks/MME-RealWorld) | 160 | | V*Bench | [V ∗ : Guided Visual Search as a Core Mechanism in Multimodal LLMs](https://arxiv.org/pdf/2312.14135) | [Github](https://github.com/penghao-wu/vstar) | 161 | | FaceBench | [FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs](https://arxiv.org/pdf/2503.21457) | [Github](https://github.com/CVI-SZU/FaceBench) | 162 | | MMAFFBench | [MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs](https://arxiv.org/pdf/2505.24423) | [Github](https://github.com/lzw108/MMAFFBen) | 163 | | FABA-Bench | [Facial Affective Behavior Analysis with Instruction Tuning ](https://arxiv.org/pdf/2404.05052) | [Github](https://github.com/JackYFL/EmoLA) | 164 | | MEMO-Bench | [MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis ](https://arxiv.org/pdf/2411.11235) | [Github](https://github.com/zyj-2000/MEMO-Bench) | 165 | | EmoBench | [EmoBench: Evaluating the Emotional Intelligence of Large Language Models](https://arxiv.org/pdf/2402.12071) | [Github](https://github.com/Sahandfer/EmoBench) | 166 | | EEmo-Bench | [EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment ](https://arxiv.org/pdf/2504.16405) | [Github](https://github.com/workerred/EEmo-Bench) | 167 | | AesBench | [AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception](https://arxiv.org/pdf/2401.08276) | [Github](https://github.com/yipoh/AesBench) | 168 | | UNIAA-Bench | [UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark](https://arxiv.org/pdf/2404.09619) | [Github](https://github.com/KwaiVGI/Uniaa) | 169 | | ImplictAVE | [ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction](https://arxiv.org/pdf/2404.15592) | [Github](https://github.com/HenryPengZou/ImplicitAVE) | 170 | | II-Bench | [II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models](https://arxiv.org/pdf/2406.05862) | [Hugging Face](https://huggingface.co/datasets/m-a-p/II-Bench) | 171 | | CogBench | [A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models](https://arxiv.org/pdf/2402.18409) | [Github](https://github.com/X-LANCE/CogBench) | 172 | | A4Bench | [Affordance Benchmark for MLLMs](https://arxiv.org/pdf/2506.00893) | [Github](https://github.com/JunyingWang959/A4Bench) | 173 | | MM-SAP | [MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception](https://arxiv.org/pdf/2401.07529) | [Github](https://github.com/YHWmz/MM-SAP) | 174 | | Cambrian-1 | [Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/pdf/2406.16860) | [Github](https://github.com/cambrian-mllm/cambrian) | 175 | | MMUBench | [Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models](https://arxiv.org/pdf/2405.12523) | *Not available* | 176 | | MMVP | [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/pdf/2401.06209) | [Github](https://github.com/tsb0601/MMVP) | 177 | | MicBench | [Towards Open-ended Visual Quality Comparison](https://arxiv.org/pdf/2402.16641) | [Github](https://github.com/Q-Future/Co-Instruct) | 178 | | CuturalVQA | [Benchmarking Vision Language Models for Cultural Understanding](https://arxiv.org/pdf/2407.10920) | [Hugging Face](https://huggingface.co/datasets/mair-lab/CulturalVQA) | 179 | | RefCOCO Family | [Modeling Context in Referring Expressions](https://arxiv.org/pdf/1511.02283)[Generation and Comprehension of Unambiguous Object Descriptions](https://arxiv.org/pdf/1511.02283) | [Github](https://github.com/lichengunc/refer) [Github](https://github.com/mjhucla/Google_Refexp_toolbox) | 180 | | Ref-L4 | [Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models](https://arxiv.org/pdf/2406.16866) | [Github](https://github.com/JierunChen/Ref-L4) | 181 | | MRES-32M | [Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation ](https://arxiv.org/pdf/2312.08007) | [Github](https://github.com/Rubics-Xuan/MRES) | 182 | | UrBench | [UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios ](https://arxiv.org/pdf/2408.17267) | [Github](https://github.com/opendatalab/UrBench) | 183 | | COUNTS | [COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts](https://arxiv.org/pdf/2504.10158) | [Github](https://github.com/jiansheng-li/COUNTS_benchmark) | 184 | | MTVQA | [MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering](https://arxiv.org/pdf/2405.11985) | [Github](https://github.com/bytedance/MTVQA) | 185 | | GePBench | [GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models ](https://arxiv.org/pdf/2412.21036) | *Not available* | 186 | | SpatialMQA | [Can Multimodal Large Language Models Understand Spatial Relations?](https://arxiv.org/pdf/2505.19015) | [Github](https://github.com/ziyan-xiaoyu/SpatialMQA) | 187 | | SpacialRGPT-Bench | [SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models](https://arxiv.org/pdf/2406.01584) | [Github](https://github.com/AnjieCheng/SpatialRGPT) | 188 | | CoSpace | [CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models ](https://arxiv.org/pdf/2503.14161) | [Github](https://github.com/THUNLP-MT/CoSpace/) | 189 | | LMM-CompBench | [MLLM-COMPBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs](https://arxiv.org/pdf/2407.16837) | [Github](https://github.com/RaptorMai/MLLM-CompBench) | 190 | | SOK-Bench | [SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge](https://arxiv.org/pdf/2405.09713) | [Github](https://github.com/csbobby/SOK-Bench) | 191 | | GSR-Bench | [GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs](https://arxiv.org/pdf/2406.13246) | *Not available* | 192 | | What's "up" | [What’s “up” with vision-language models? Investigating their struggle with spatial reasoning](https://arxiv.org/pdf/2310.19785) | [Github](https://github.com/amitakamath/whatsup_vlms) | 193 | | Q-Spatial Bench | [Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models](https://arxiv.org/pdf/2409.09788) | [Github](https://github.com/andrewliao11/Q-Spatial-Bench-code) | 194 | | AS-V2 | [The All-Seeing Project V2: Towards General Relation Comprehension of the Open World](https://arxiv.org/pdf/2402.19474) | [Github](https://github.com/OpenGVLab/all-seeing) | 195 | | Visual CoT | [Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning](https://arxiv.org/pdf/2403.16999) | [Github](https://github.com/deepcs233/Visual-CoT) | 196 | | LogicVista | [LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts ](https://arxiv.org/pdf/2407.04973) | [Github](https://github.com/Yijia-Xiao/LogicVista) | 197 | | VisuLogic | [VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models](https://arxiv.org/pdf/2504.15279) | [Github](https://github.com/VisuLogic-Benchmark) | 198 | | CoMT | [CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models](https://arxiv.org/pdf/2412.12932) | [Github](https://github.com/czhhzc/CoMT) | 199 | | PUZZLES | [PUZZLES: A Benchmark for Neural Algorithmic Reasoning ](https://arxiv.org/pdf/2407.00401) | [Github](https://github.com/ETH-DISCO/rlp) | 200 | | LOVA3 | [LOVA3 : Learning to Visual Question Answering, Asking and Assessment](https://arxiv.org/pdf/2405.14974) | [Github](https://github.com/showlab/LOVA3) | 201 | | VLIKEB | [VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark](https://arxiv.org/pdf/2403.07350) | [Github](https://github.com/VLKEB/VLKEB) | 202 | | MMKE-Bench | [MMKE-BENCH: A MULTIMODAL EDITING BENCHMARK FOR DIVERSE VISUAL KNOWLEDGE](https://arxiv.org/pdf/2502.19870) | [Github](https://github.com/MMKE-Bench-ICLR/MMKE-Bench) | 203 | | MC-MKE | [MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency](https://arxiv.org/pdf/2406.13219)[MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing](https://arxiv.org/pdf/2402.14835) | *Not available* | 204 | | NegVQA | [NegVQA: Can Vision Language Models Understand Negation? ](https://arxiv.org/pdf/2505.22946) | [Github](https://github.com/yuhui-zh15/NegVQA) | 205 | | LongBench | [LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding](https://arxiv.org/pdf/2308.14508) | [Github](https://github.com/THUDM/LongBench) | 206 | | OPOR-BENCH | [OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation](https://arxiv.org/abs/2512.01896) | *Not available* | 207 | | VRT-Bench | Visual [Reasoning Tracer: Object-Level Grounded Reasoning Benchmark](https://arxiv.org/abs/2512.05091) | [Github](https://github.com/bytedance/Sa2VA/tree/main/projects/vrt_sa2va) | 208 | 209 | #### Comprehensive Perception 210 | 211 | | Benchmark | Paper | Project Page | 212 | | :---------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 213 | | LVLM-eHub | [Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.](https://arxiv.org/abs/2306.09265) | [GitHub](https://github.com/OpenGVLab/Multi-Modality-Arena) | 214 | | TinyLVLM-eHub | [Tinylvlm-ehub: Towards comprehensive and efficient evaluation for large vision-language models.](https://arxiv.org/abs/2308.03729) | [GitHub](https://github.com/OpenGVLab/Multi-Modality-Arena) | 215 | | LAMM | [Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark.](https://arxiv.org/abs/2306.06687) | [GitHub](https://github.com/OpenLAMM/LAMM) | 216 | | MME | [Mme: A comprehensive evaluation benchmark for multimodal large language models.](https://arxiv.org/abs/2306.13394) | [Project Page](https://mme-benchmark.github.io/home_page.html) | 217 | | MMBench | [Mmbench: Is your multi-modal model an all-around player?](https://arxiv.org/abs/2307.06281) | [GitHub](https://github.com/open-compass/mmbench) | 218 | | SEED-Bench series | [SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension.](https://arxiv.org/abs/2307.16125) | [GitHub](https://github.com/AILab-CVC/SEED-Bench) | 219 | | MMT-Bench | [Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.](https://arxiv.org/abs/2404.16006) | [GitHub](https://github.com/OpenGVLab/MMT-Bench) | 220 | | LMMs-Eval | [Lmms-eval: Reality check on the evaluation of large multimodal models.](https://arxiv.org/abs/2407.12772) | [GitHub](https://github.com/EvolvingLMMs-Lab/lmms-eval) | 221 | | MMStar | [Are we on the right way for evaluating large vision-language models?](https://arxiv.org/abs/2403.20330) | [GitHub](https://github.com/MMStar-Benchmark/MMStar) | 222 | | NaturalBench | [Naturalbench: Evaluating vision-language models on natural adversarial samples.](https://arxiv.org/abs/2410.14669) | [Project Page](https://github.com/Baiqi-Li/NaturalBench) | 223 | | MM-Vet | [Mm-vet: Evaluating large multimodal models for integrated capabilities.](https://arxiv.org/abs/2308.02490) | [GitHub](https://github.com/yuweihao/MM-Vet) | 224 | | ChEF | [Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models.](https://arxiv.org/abs/2311.02692) | [GitHub](https://openlamm.github.io/ChEF/) | 225 | | Video-MME | [Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.](https://arxiv.org/abs/2405.21075) | [GitHub](https://github.com/MME-Benchmarks/Video-MME) | 226 | | MMBench-Video | [Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.](https://arxiv.org/abs/2406.14515) | [GitHub](https://mmbench-video.github.io/) | 227 | | MVBench | [Mvbench: A comprehensive multi-modal video understanding benchmark.](https://arxiv.org/abs/2311.17005) | [Hugging Face](https://huggingface.co/datasets/OpenGVLab/MVBench) | 228 | | LongVideoBench | [Longvideobench: A benchmark for long-context interleaved video-language understanding.](https://arxiv.org/abs/2407.15754) | [GitHub](https://github.com/longvideobench/LongVideoBench) | 229 | | LVBench | [Lvbench: An extreme long video understanding benchmark.](https://arxiv.org/abs/2406.08035) | [GitHub](https://github.com/zai-org/LVBench) | 230 | | MotionBench | [Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models.](https://arxiv.org/html/2501.02955v1) | [GitHub](https://github.com/zai-org/MotionBench) | 231 | | AudioBench | [Audiobench: A universal benchmark for audio large language models.](https://arxiv.org/abs/2406.16020) | [GitHub](https://github.com/AudioLLMs/AudioBench) | 232 | | AIR-Bench | [Air-bench: Benchmarking large audio-language models via generative comprehension.](https://arxiv.org/abs/2402.07729) | [GitHub](https://github.com/OFA-Sys/AIR-Bench) | 233 | | Dynamic-SUPERB | [Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech.](https://arxiv.org/abs/2309.09510) | [GitHub](https://github.com/dynamic-superb/dynamic-superb) | 234 | | M3DBench | [M3dbench: Let's instruct large models with multi-modal 3d prompts.](https://arxiv.org/abs/2312.10763) | [GitHub](https://github.com/OpenM3D/M3DBench) | 235 | | M3D | [M3d: Advancing 3d medical image analysis with multi-modal large language models.](https://arxiv.org/abs/2404.00578) | [GitHub](https://github.com/BAAI-DCAI/M3D) | 236 | | Space3D-Bench | [Space3d-bench: Spatial 3d question answering benchmark.](https://arxiv.org/abs/2408.16662) | [Project Page](https://space3d-bench.github.io/) | 237 | 238 | #### General Knowledge 239 | 240 | | Benchmark | Paper | Project Page | 241 | | :---------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 242 | | ScienceQA | [Learn to explain: Multimodal reasoning via thought chains for science question answering.](https://arxiv.org/abs/2209.09513) | [GitHub](https://github.com/lupantech/ScienceQA) | 243 | | CMMU | [Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning.](https://arxiv.org/abs/2401.14011) | [GitHub](https://github.com/FlagOpen/CMMU) | 244 | | Scibench | [Scibench: Evaluating college-level scientific problem-solving abilities of large language models.](https://arxiv.org/abs/2307.10635) | [GitHub](https://github.com/mandyyyyii/scibench) | 245 | | EXAMS-V | [Exams-v: A multi-discipline multilingual multi-modal exam benchmark for evaluating vision language models.](https://arxiv.org/abs/2403.10378) | [Hugging Face](https://huggingface.co/datasets/Rocktim/EXAMS-V) | 246 | | MMMU | [Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.](https://arxiv.org/abs/2311.16502) | [GitHub](https://github.com/MMMU-Benchmark/MMMU) | 247 | | MMMU-Pro | [Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.](https://arxiv.org/abs/2409.02813) | [Hugging Face](https://huggingface.co/datasets/MMMU/MMMU_Pro/blob/main/README.md) | 248 | | HLE | [Humanity's last exam.](https://arxiv.org/html/2501.14249v1) | [Project Page](https://otaheri.github.io/publication/2025_lastexam/) | 249 | | CURIE | [Curie: Evaluating llms on multitask scientific long context understanding and reasoning.](https://arxiv.org/abs/2503.13517) | [GitHub](https://github.com/google/curie) | 250 | | SFE | [Scientists' first exam: Probing cognitive abilities of mllm via perception, understanding, and reasoning.](https://arxiv.org/abs/2506.10521) | [Hugging Face](https://huggingface.co/papers/2506.10521) | 251 | | MMIE | [Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models.](https://arxiv.org/abs/2410.10139) | [GitHub](https://github.com/Lillianwei-h/MMIE) | 252 | | MDK12-Bench | [Mdk12-bench: A multi-discipline benchmark for evaluating reasoning in multimodal large language models.](https://arxiv.org/abs/2504.05782) | [GitHub](https://github.com/LanceZPF/MDK12) | 253 | | EESE | [The ever-evolving science exam.](https://arxiv.org/abs/2507.16514) | [Project Page](https://github.com/aiben-ch/EESE) | 254 | | Q-Mirror | [Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs](https://arxiv.org/abs/2509.24297) | [GitHub](https://github.com/aiben-ch/Q-Mirror) | 255 | 256 | #### Safety 257 | 258 | | **Benchmark** | **Paper** | **Project Page** | 259 | | :-------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 260 | | Unicorn | [How many unicorns are in this image? a safety evaluation benchmark for vision llms.](https://arxiv.org/abs/2311.16101) | [GitHub](https://github.com/UCSC-VLAA/vllm-safety-benchmark) | 261 | | JailbreakV-28K | [Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.](https://arxiv.org/abs/2404.03027) | [GitHub](https://github.com/SaFoLab-WISC/JailBreakV_28K) | 262 | | MM-SafetyBench | [Mm-safetybench: A benchmark for safety evaluation of multimodal large language models.](https://arxiv.org/abs/2311.17600) | [GitHub](https://github.com/isXinLiu/MM-SafetyBench) | 263 | | AVIBench | [Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions.](https://arxiv.org/abs/2403.09346) | [GitHub](https://github.com/zhanghao5201/B-AVIBench) | 264 | | MMJ-Bench | [MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models.](https://arxiv.org/abs/2408.08464) | [GitHub](https://github.com/thunxxx/MLLM-Jailbreak-evaluation-MMJ-bench) | 265 | | USB | [Usb: A comprehensive and unified safety evaluation benchmark for multimodal large language models.](https://arxiv.org/abs/2505.23793) | [GitHub](https://github.com/Hongqiong12/USB-SafeBench) | 266 | | MLLMGuard | [Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models.](https://arxiv.org/abs/2406.07594) | [GitHub](https://github.com/Carol-gutianle/MLLMGuard) | 267 | | SafeBench | [Safebench: A safety evaluation framework for multimodal large language models.](https://arxiv.org/abs/2410.18927) | [GitHub](https://safebench-mm.github.io/) | 268 | | MemeSafetyBench | [Are vision-language models safe in the wild? a meme-based benchmark study.](https://arxiv.org/abs/2505.15389) | [Hugging Face](https://huggingface.co/papers/2505.15389) | 269 | | UnsafeBench | [Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images.](https://arxiv.org/abs/2405.03486) | *Not available* | 270 | | POPE | [Evaluating object hallucination in large vision-language models.](https://arxiv.org/abs/2305.10355) | [GitHub](https://github.com/RUCAIBox/POPE) | 271 | | M-HalDetect | [Detecting and preventing hallucinations in large vision language models.](https://arxiv.org/abs/2308.06394) | *Not available* | 272 | | Hal-Eval | [Hal-eval: A universal and fine-grained hallucination evaluation framework for large vision language models.](https://arxiv.org/abs/2402.15721) | [GitHub](https://github.com/WisdomShell/hal-eval/tree/main) | 273 | | Hallu-pi | [Hallu-pi: Evaluating hallucination in multi-modal large language models within perturbed inputs.](https://arxiv.org/abs/2408.01355) | *Not available* | 274 | | BEAF | [Beaf: Observing before-after changes to evaluate hallucination in vision-language models.](https://dl.acm.org/doi/10.1007/978-3-031-73247-8_14) | [GitHub](https://beafbench.github.io/) | 275 | | HallusionBench | [Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models.](https://arxiv.org/abs/2310.14566) | [Project Page](https://github.com/tianyi-lab/HallusionBench) | 276 | | AutoHallusion | [Autohallusion: Automatic generation of hallucination benchmarks for vision-language models.](https://arxiv.org/abs/2406.10900) | [GitHub](https://github.com/wuxiyang1996/AutoHallusion) | 277 | | MultiTrust | [Benchmarking trustworthiness of multimodal large language models: A comprehensive study.](https://arxiv.org/abs/2406.07057) | [GitHub](https://github.com/thu-ml/MMTrustEval) | 278 | | MMDT | [Mmdt: Decoding the trustworthiness and safety of multimodal foundation models.](https://proceedings.iclr.cc/paper_files/paper/2025/file/0bcfb525c8f8f07ae10a93d0b2a40e00-Paper-Conference.pdf) | [GitHub](https://github.com/AI-secure/MMDT) | 279 | | Text2VLM | [Text2vlm: Adapting text-only datasets to evaluate alignment training in visual language models.](https://arxiv.org/abs/2507.20704) | *Not available* | 280 | | MOSSBench | [Mossbench: Is your multimodal language model oversensitive to safe queries?](https://arxiv.org/abs/2406.17806) | [GitHub](https://github.com/xirui-li/MOSSBench) | 281 | | CulturalVQA | [Benchmarking vision language models for cultural understanding.](https://arxiv.org/abs/2407.10920) | [Project Page](https://culturalvqa.org/) | 282 | | ModScan | [Modscan: Measuring stereotypical bias in large vision-language models from vision and language modalities.](https://arxiv.org/html/2410.06967v1) | [GitHub](https://github.com/TrustAIRLab/ModSCAN/tree/main) | 283 | | FMBench | [Fmbench: Benchmarking fairness in multimodal large language models on medical tasks.](https://arxiv.org/abs/2410.01089) | Not available | 284 | | FairMedFM | [Fairmedfm: Fairness benchmarking for medical imaging foundation models.](https://arxiv.org/abs/2407.00983) | [GitHub](https://github.com/FairMedFM/FairMedFM) | 285 | | FairCLIP | [Fair-clip: Harnessing fairness in vision-language learning.](https://arxiv.org/abs/2403.19949) | [GitHub](https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP) | 286 | | DoxingBench | [Doxing via the lens: Revealing privacy leakage in image geolocation for agentic multi-modal large reasoning model.](https://arxiv.org/abs/2504.19373) | [Hugging Face](https://huggingface.co/papers/2504.19373) | 287 | | PrivQA | [Can language models be instructed to protect personal information?](https://arxiv.org/abs/2310.02224) | *Not available* | 288 | | SHIELD | [Shield: An evaluation benchmark for face spoofing and forgery detection with multimodal large language models.](https://arxiv.org/abs/2402.04178) | [GitHub](https://github.com/laiyingxin2/SHIELD) | 289 | | ExtremeAIGC | [Extremeaigc: Benchmarking lmm vulnerability to ai-generated extremist content.](https://arxiv.org/abs/2503.09964) | *Not available* | 290 | 291 | ### Specialized 292 | 293 | #### Math 294 | 295 | | **Benchmark** | **Paper** | **Project Page** | 296 | | :------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 297 | | MathVista | [Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.](https://arxiv.org/abs/2310.02255) | [GitHub](https://github.com/lupantech/MathVista) | 298 | | PolyMATH | [Polymath: A challenging multi-modal mathematical reasoning benchmark.](https://arxiv.org/abs/2410.14702) | [Project Page](https://polymathbenchmark.github.io/) | 299 | | MATH-Vision | [Measuring multimodal mathematical reasoning with math-vision dataset.](https://arxiv.org/abs/2402.14804) | [Project Page](https://mathvision-cuhk.github.io/) | 300 | | Olympiad-Bench | [Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.](https://arxiv.org/abs/2402.14008) | [GitHub](https://github.com/OpenBMB/OlympiadBench) | 301 | | PolyMath | [Polymath: Evaluating mathematical reasoning in multilingual contexts.](https://arxiv.org/abs/2504.18428) | [GitHub](https://github.com/qwenlm/polymath) | 302 | | Math-Verse | [Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?](https://arxiv.org/abs/2403.14624) | [GitHub](https://github.com/ZrrSkywalker/MathVerse) | 303 | | WE-MATH | [We-math: Does your large multimodal model achieve human-like mathematical reasoning?](https://arxiv.org/abs/2407.01284) | `GitHub` | 304 | | MathScape | [Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark.](https://arxiv.org/abs/2408.07543) | [Github](https://github.com/Ahalfmoon/MathScape) | 305 | | CMM-Math | [Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models.](https://arxiv.org/abs/2409.02834) | [Hugging Face](https://huggingface.co/datasets/ecnu-icalk/cmm-math) | 306 | | MV-MATH | [Mv-math: Evaluating multimodal math reasoning in multi-visual contexts.](https://arxiv.org/abs/2502.20808) | [GitHub](https://github.com/eternal8080/MV-MATH) | 307 | 308 | #### Physics 309 | 310 | | **Benchmark** | **Paper** | **Project Page** | 311 | | :---------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 312 | | ScienceQA | [Learn to explain: Multimodal reasoning via thought chains for science question answering.](https://arxiv.org/abs/2209.09513) | [GitHub](https://github.com/lupantech/ScienceQA) | 313 | | TQA | [Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension.](https://openaccess.thecvf.com/content_cvpr_2017/papers/Kembhavi_Are_You_Smarter_CVPR_2017_paper.pdf) | *Not available* | 314 | | AI2D | [A diagram is worth a dozen images.](https://arxiv.org/abs/1603.07396) | [Project Page](https://allenai.org/) | 315 | | MM-PhyQA | [Mm-phyqa: Multimodal physics question answering with multi-image cot prompting.](https://arxiv.org/abs/2404.08704) | *Not available* | 316 | | PhysUniBench | [Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models.](https://arxiv.org/abs/2506.17667) | [Project Page](https://prismax-team.github.io/PhysUniBenchmark/) | 317 | | PhysicsArena | [Physicsarena: The first multimodal physics reasoning benchmark exploring variable, process, and solution dimensions.](https://arxiv.org/abs/2505.15472) | [Hugging Face](https://huggingface.co/papers/2505.15472) | 318 | | SeePhys | [Seephys: Does seeing help thinking? benchmarking vision-based physics reasoning.](https://arxiv.org/pdf/2505.19099) | [GitHub](https://github.com/AI4Phys/SeePhys) | 319 | | PhysReason | [Physreason: A comprehensive benchmark towards physics-based reasoning.](https://arxiv.org/abs/2502.12054) | [Hugging Face](https://huggingface.co/datasets/zhibei1204/PhysReason) | 320 | | OlympiadBench | [Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.](https://arxiv.org/abs/2402.14008) | [GitHub](https://github.com/OpenBMB/OlympiadBench) | 321 | | SceMQA | [Scemqa: A scientific college entrance level multimodal question answering benchmark.](https://arxiv.org/abs/2402.05138) | [GitHub](https://github.com/SceMQA/SceMQA) | 322 | | PACS | [Pacs: A dataset for physical audiovisual commonsense reasoning.](https://arxiv.org/abs/2203.11130) | [GitHub](https://github.com/samuelyu2002/PACS) | 323 | | GRASP | [GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models.](https://arxiv.org/abs/2311.09048) | [GitHub](https://github.com/i-machine-think/grasp) | 324 | | CausalVQA | [Causalvqa: A physically grounded causal reasoning benchmark for video models.](https://arxiv.org/abs/2506.09943) | [GitHub](https://github.com/facebookresearch/CausalVQA) | 325 | | LiveXiv | [Livexiv a multi-modal live benchmark based on arxiv papers content.](https://arxiv.org/abs/2410.10783) | [GitHub](https://github.com/NimrodShabtay/LiveXiv) | 326 | | VideoScience-Bench| [Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench.](https://arxiv.org/abs/2512.02942) | [GitHub](https://github.com/hao-ai-lab/VideoScience) | 327 | 328 | #### Chemistry 329 | 330 | | **Benchmark** | **Paper** | **Project Page** | 331 | | :-----------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 332 | | SMILES | [Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.](https://pubs.acs.org/doi/abs/10.1021/ci00057a005) | [Project Page](http://opensmiles.org/) | 333 | | ChEBI-20 | [Text2mol: Cross-modal molecule retrieval with natural language queries.](https://github.com/cnedwards/text2mol) | [GitHub](https://github.com/cnedwards/text2mol) | 334 | | ChemBench | [Chemllm: A chemical large language model.](https://arxiv.org/abs/2402.06852) | [GitHub](https://github.com/ChemFoundationModels/ChemLLMBench) | 335 | | SELFIES | [Self-referencing embedded strings (selfies): A 100% robust molecular string representation.](https://github.com/aspuru-guzik-group/selfies) | [GitHub](https://github.com/aspuru-guzik-group/selfies) | 336 | | InChI | [Inchi, the iupac international chemical identifier.](https://iupac.org/who-we-are/divisions/division-details/inchi/) | [Project Page](https://www.inchi-trust.org/) | 337 | | MolX | [Molx: Enhancing large language models for molecular learning with a multi-modal extension.](https://arxiv.org/abs/2406.06777) | *Not available* | 338 | | GiT-Mol | [Git-mol: A multi-modal large language model for molecular science with graph, image, and text.](https://arxiv.org/abs/2308.06911) | [GitHub](https://github.com/AI-HPC-Research-Team/GIT-Mol) | 339 | | Instruct-Mol | [Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery.](https://arxiv.org/abs/2311.16208) | [GitHub](https://idea-xl.github.io/InstructMol/) | 340 | | ChEBI-20-MM | [A quantitative analysis of knowledge-learning preferences in large language models in molecular science.](https://arxiv.org/abs/2402.04119v2) | [GitHub](https://github.com/AI-HPC-Research-Team/SLM4Mol) | 341 | | MMCR-Bench | [Chemvlm: Exploring the power of multimodal large language models in chemistry area.](https://arxiv.org/abs/2408.07246) | [GitHub](https://github.com/jiaqingxie/ChemVLM-Mechanism) | 342 | | MACBench | [Probing the limitations of multimodal language models for chemistry and materials research.](https://www.researchgate.net/publication/386144119_Probing_the_limitations_of_multimodal_language_models_for_chemistry_and_materials_research) | [GitHub](https://github.com/lamalab-org/macbench) | 343 | | 3D-MoLM | [Towards 3d molecule-text interpretation in language models.](https://arxiv.org/abs/2401.13923) | [GitHub](https://github.com/lsh0520/3D-MoLM) | 344 | | M3-20M | [M3-20m: A large-scale multi-modal molecule dataset for ai-driven drug design and discovery.](https://arxiv.org/abs/2412.06847v2) | [GitHub](https://github.com/bz99bz/M-3) | 345 | | MassSpecGym | [Massspecgym: A benchmark for the discovery and identification of molecules.](https://arxiv.org/abs/2410.23326v1) | [GitHub](https://polarishub.io/datasets/roman-bushuiev/massspecgym) | 346 | | MolPuzzle | [Can llms solve molecule puzzles? a multimodal benchmark for molecular structure elucidation.](https://kehanguo2.github.io/Molpuzzle.io/paper/SpectrumLLM__Arxiv_.pdf) | [GitHub](https://kehanguo2.github.io/Molpuzzle.io/) | 347 | 348 | #### Finance 349 | 350 | | Benchmark | Paper | Project Page | 351 | | :------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 352 | | FinMME | [FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation](https://arxiv.org/abs/2505.24714) | [Github](https://github.com/luo-junyu/FinMME) | 353 | | FAMMA | [FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering](https://arxiv.org/abs/2410.04526) | [Project Page](https://famma-bench.github.io/famma/) | 354 | | MME-Finance | [MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning](https://arxiv.org/abs/2411.03314) | [Project Page](https://hithink-research.github.io/MME-Finance/) | 355 | | MultiFinBen | [MultiFinBen: A Comprehensive Multimodal Financial Benchmark](https://arxiv.org/abs/2506.14028) | [Github](https://github.com/xueqingpeng/MultiFinBen) | 356 | | CFBenchmark-MM | [CFBenchmark-MM: A Comprehensive Multimodal Financial Benchmark](https://www.arxiv.org/abs/2506.13055) | [Github](https://github.com/TongjiFinLab/CFBenchmark) | 357 | | FinMMR | [FinMMR: Multimodal Financial Reasoning Benchmark](https://arxiv.org/abs/2508.04625v1) | [Github](https://github.com/BUPT-Reasoning-Lab/FinMMR) | 358 | | Fin-Fact | [Fin-Fact: Financial Fact Checking Dataset](https://arxiv.org/abs/2309.08793) | [Github](https://github.com/IIT-DM/Fin-Fact/) | 359 | | FCMR | [FCMR: Financial Multimodal Reasoning](https://arxiv.org/abs/2412.12567) | | 360 | | FinTral | [FinTral: Financial Translation and Analysis](https://arxiv.org/abs/2402.10986) | [Github](https://github.com/UBC-NLP/fintral) | 361 | | Open-FinLLMs | [Open-FinLLMs: Open Financial Large Language Models](https://arxiv.org/abs/2408.11878) | [Hugging Face](https://huggingface.co/collections/TheFinAI/open-finllms-66b671f2b4958a65e20decbe) | 362 | | FinGAIA | [FinGAIA: Financial AI Assistant](https://arxiv.org/abs/2507.17186) | [Github](https://github.com/SUFE-AIFLM-Lab/FinGAIA) | 363 | 364 | #### Healthcare & Medical Science 365 | 366 | | Benchmark | Paper | Project Page | 367 | | :-----------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 368 | | VQA-RAD | [VQA-RAD: Visual Question Answering Radiology Dataset](https://www.nature.com/articles/sdata2018251) | [Project Page](https://www.nature.com/articles/sdata2018251) | 369 | | PathVQA | [PathVQA: Pathology Visual Question Answering](https://arxiv.org/abs/2003.10286) | [Github](https://github.com/KaveeshaSilva/PathVQA) | 370 | | RP3D-DiagDS | [RP3D-DiagDS: 3D Medical Diagnosis Dataset](https://arxiv.org/abs/2312.16151) | [Project Page](https://qiaoyu-zheng.github.io/RP3D-Diag/) | 371 | | PubMedQA | [PubMedQA: Medical Question Answering Dataset](https://arxiv.org/abs/1909.06146) | [Project Page](https://pubmedqa.github.io/) | 372 | | HealthBench | [HealthBench: Medical AI Benchmark](https://arxiv.org/abs/2505.08775) | [Project Page](https://www.healthbench.co/) | 373 | | GMAI-MMBench | [GMAI-MMBench: General Medical AI Multimodal Benchmark](https://arxiv.org/abs/2408.03361) | [Project Page](https://uni-medical.github.io/GMAI-MMBench.github.io/#2023xtuner) | 374 | | OpenMM-Medical | [OpenMM-Medical: Open Medical Multimodal Model](https://arxiv.org/abs/2501.15368) | [Github](https://github.com/baichuan-inc/Baichuan-Omni-1.5) | 375 | | Genomics-Long-Range | [Genomics-Long-Range: Long-Range Genomic Benchmark](https://openreview.net/forum?id=8O9HLDrmtq) | [Hugging Face](https://huggingface.co/datasets/InstaDeepAI/genomics-long-range-benchmark) | 376 | | Genome-Bench | [Genome-Bench: Comprehensive Genomics Benchmark](https://arxiv.org/abs/2505.19501v1) | [Hugging Face](https://huggingface.co/datasets/Mingyin0312/Genome-Bench) | 377 | | MedAgentsBench | [MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning](https://arxiv.org/pdf/2503.07459) | [Github](https://github.com/gersteinlab/medagents-benchmark) | 378 | | MedQ-Bench | [MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs](https://arxiv.org/abs/2510.01691) | [Github](https://github.com/liujiyaoFDU/MedQ-Bench) | 379 | 380 | #### Code 381 | 382 | | Benchmark | Paper | Project Page | 383 | | :-------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 384 | | Design2Code | [Design2Code: From Design Mockups to Code](https://arxiv.org/abs/2403.03163) | [Project Page](https://salt-nlp.github.io/Design2Code/) | 385 | | Web2Code | [Web2Code: Web-to-Code Generation](https://arxiv.org/abs/2406.20098) | [Project Page](https://mbzuai-llm.github.io/webpage2code/) | 386 | | Plot2Code | [Plot2Code: From Charts to Code](https://arxiv.org/abs/2405.07990) | [Hugging Face](https://huggingface.co/datasets/TencentARC/Plot2Code) | 387 | | ChartMimic | [ChartMimic: Chart Understanding and Generation](https://arxiv.org/abs/2406.09961) | [Project Page](https://chartmimic.github.io/) | 388 | | HumanEval-V | [HumanEval-V: Visual Code Generation Benchmark](https://arxiv.org/abs/2410.12381) | [Project Page](https://humaneval-v.github.io/) | 389 | | Code-Vision | [Code-Vision: Visual Code Understanding](https://arxiv.org/abs/2502.11829) | [Github](https://github.com/wanghanbinpanda/CodeVision) | 390 | | SWE-bench Multi-modal | [SWE-bench Multi-modal: Software Engineering Benchmark](https://arxiv.org/abs/2410.03859) | [Project Page](https://www.swebench.com/multimodal.html) | 391 | | MMCode | [MMCode: Multimodal Code Generation](https://arxiv.org/abs/2404.09486) | [Github](https://github.com/likaixin2000/MMCode) | 392 | | M²Eval | [M²Eval: Multimodal Code Evaluation](https://arxiv.org/abs/2507.08719) | [Github](https://github.com/MCEVAL/MMCoder) | 393 | | BigDocs-Bench | [BigDocs-Bench: Large Document Understanding](https://arxiv.org/abs/2412.04626) | [Project Page](https://bigdocs.github.io/) | 394 | | BigDocs-Bench | [Bigdocs: An open dataset for training multimodal models on document and code tasks.](https://arxiv.org/abs/2412.04626) | [GitHub](https://github.com/Big-Docs/BigDocs) | 395 | 396 | ### Autonomous Driving 397 | 398 | | Benchmark | Paper | Project Page | 399 | | :----------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 400 | | Rank2Tell | [Rank2Tell: Ranking-based Visual Storytelling](https://arxiv.org/abs/2309.06597) | [Project Page](https://usa.honda-ri.com/rank2tell) | 401 | | DRAMA | [DRAMA: Dynamic Risk Assessment for Autonomous Vehicles](https://arxiv.org/abs/2209.10767) | [Project Page](https://usa.honda-ri.com/drama) | 402 | | NuScenes-QA | [NuScenes-QA: Autonomous Driving Question Answering](https://arxiv.org/abs/2305.14836) | [Github](https://github.com/qiantianwen/NuScenes-QA) | 403 | | LingoQA | [LingoQA: Driving Language Understanding](https://arxiv.org/abs/2312.14115) | [Github](https://github.com/wayveai/LingoQA) | 404 | | V2V-LLM | [V2V-LLM: Vehicle-to-Vehicle Communication](https://arxiv.org/abs/2502.09980) | [Project Page](https://eddyhkchiu.github.io/v2vllm.github.io/) | 405 | | MAPLM-QA | [MAPLM: Real-World Large-Scale Vision-Language Benchmark for Map and Traffic](https://openaccess.thecvf.com/content/CVPR2024/papers/Cao_MAPLM_A_Real-World_Large-Scale_Vision-Language_Benchmark_for_Map_and_Traffic_CVPR_2024_paper.pdf) | [Github](https://github.com/LLVM-AD/MAPLM) | 406 | | SURDS | [SURDS: Autonomous Driving Dataset](https://github.com/XiandaGuo/Drive-MLLM) | [Github](https://github.com/XiandaGuo/Drive-MLLM) | 407 | | AD2-Bench | [AD2-Bench: Autonomous Driving Benchmark](https://arxiv.org/abs/2506.09557) | | 408 | | DriveAction | [DriveAction: Driving Action Recognition](https://arxiv.org/abs/2506.05667v1) | [Hugging Face](https://huggingface.co/datasets/LiAuto-DriveAction/drive-action) | 409 | | DriveLMM-01 | [DriveLMM-01: Driving Language Model](https://arxiv.org/abs/2503.10621) | [Github](https://github.com/mbzuai-oryx/DriveLMM-o1) | 410 | | DriveVLM | [DriveVLM: Vision-Language Model for Driving](https://arxiv.org/abs/2402.12289) | [Project Page](https://tsinghua-mars-lab.github.io/DriveVLM/) | 411 | | RoboTron-Sim | [RoboTron-Sim: Robot Simulation Platform](https://www.arxiv.org/abs/2508.04642) | [Project Page](https://stars79689.github.io/RoboTron-Sim/) | 412 | | IDKB | [IDKB: Intelligent Driving Knowledge Base](https://arxiv.org/abs/2409.02914) | [Project Page](https://4dvlab.github.io/project_page/idkb.html) | 413 | | VLADBench | [VLADBench: Vision-Language-Action Driving Benchmark](https://arxiv.org/abs/2503.21505) | [Github](https://github.com/Depth2World/VLADBench) | 414 | | DriVQA | [DriVQA: Driving Visual Question Answering](https://www.sciencedirect.com/science/article/pii/S235234092500099X) | | 415 | | ADGV-Bench | [Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework](https://arxiv.org/abs/2512.06376) | *Not available* | 416 | 417 | #### Earth Science / Remote Sensing 418 | 419 | | Benchmark | Paper | Project Page | 420 | | :---------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 421 | | GEOBench-VLM | [GEOBench-VLM: Geospatial Vision-Language Model Benchmark](https://arxiv.org/abs/2411.19325) | [Project Page](http://the-ai-alliance.github.io/GEO-Bench-VLM) | 422 | | ClimaQA | [ClimaQA: Climate Question Answering](https://arxiv.org/abs/2410.16701) | [Github](https://github.com/Rose-STL-Lab/genie-climaqa) | 423 | | ClimateBERT | [ClimateBERT: Climate Language Model](https://arxiv.org/abs/2110.12010) | [Project Page](https://www.chatclimate.ai/climatebert) | 424 | | WeatherQA | [WeatherQA: Weather Question Answering](https://arxiv.org/abs/2406.11217) | [Github](https://github.com/chengqianma/WeatherQA) | 425 | | OceanBench | [OceanBench: Ocean Data Analysis Benchmark](https://arxiv.org/abs/2309.15599) | [Project Page](https://jejjohnson.github.io/oceanbench/content/overview.html) | 426 | | OmniEarth-Bench | [OmniEarth-Bench: Comprehensive Earth Observation](https://arxiv.org/abs/2505.23522) | [Hugging Face](https://huggingface.co/datasets/initiacms/OmniEarth-Bench) | 427 | | MSEarth | [MSEarth: Multi-Scale Earth Observation](https://arxiv.org/abs/2505.20740) | [Github](https://github.com/xiangyu-mm/MSEarth) | 428 | | EarthSE | [EarthSE: Earth System Evaluation](https://www.arxiv.org/abs/2505.17139) | [Hugging Face](https://huggingface.co/ai-earth) | 429 | | RSICD | [RSICD: Remote Sensing Image Captioning Dataset](https://arxiv.org/abs/1712.07835) | [Github](https://github.com/201528014227051/RSICD_optimal) | 430 | | NWPU-Captions | [NWPU-Captions: Remote Sensing Image Descriptions](https://arxiv.org/abs/2402.06475v1) | [Github](https://github.com/HaiyanHuang98/NWPU-Captions) | 431 | | RSVQA-HRBEN/LRBEN | [RSVQA: Remote Sensing Visual Question Answering](https://arxiv.org/abs/2003.07333) | [Project Page](https://rsvqa.sylvainlobry.com/) | 432 | | DIOR-RSVG | [DIOR-RSVG: Remote Sensing Visual Grounding](https://arxiv.org/pdf/2210.12634) | [Github](https://github.com/ZhanYang-nwpu/RSVG-pytorch) | 433 | | VRSBench | [VRSBench: Visual Remote Sensing Benchmark](https://arxiv.org/abs/2406.12384) | [Project Page](https://vrsbench.github.io/) | 434 | | LRS-VQA | [LRS-VQA: Large-scale Remote Sensing VQA](https://arxiv.org/abs/2503.07588) | [Github](https://github.com/VisionXLab/LRS-VQA) | 435 | | GeoChat-Bench | [GeoChat-Bench: Geospatial Conversation Benchmark](https://arxiv.org/pdf/2311.15826) | [Github](https://github.com/mbzuai-oryx/geochat) | 436 | | XLRS-Bench | [XLRS-Bench: Cross-Lingual Remote Sensing](https://arxiv.org/abs/2503.23771) | [Project Page](https://xlrs-bench.github.io/) | 437 | | RSIEval | [RSIEval: Remote Sensing Image Evaluation](https://arxiv.org/abs/2307.15266) | [Github](https://github.com/Lavender105/RSGPT) | 438 | | UrBench | [UrBench: Urban Remote Sensing Benchmark](https://arxiv.org/abs/2408.17267) | [Project Page](https://opendatalab.github.io/UrBench/) | 439 | | CHOICE | [CHOICE: Comprehensive Remote Sensing Benchmark](https://arxiv.org/abs/2411.18145) | [Github](https://github.com/ShawnAn-WHU/CHOICE) | 440 | | SARChat-Bench-2M | [SARChat-Bench-2M: SAR Image Understanding](https://arxiv.org/abs/2502.08168) | [Github](https://github.com/JimmyMa99/SARChat) | 441 | | LHRS-Bench | [LHRS-Bench: Large-scale High-Resolution Remote Sensing](https://arxiv.org/abs/2402.02544) | [Github](https://github.com/NJU-LHRS/LHRS-Bot) | 442 | | FIT-RSFG | [FIT-RSFG: Remote Sensing Fine-Grained Recognition](https://arxiv.org/abs/2406.10100) | [Github](https://github.com/Luo-Z13/SkySenseGPT) | 443 | | VLEO-Bench | [VLEO-Bench: Very Low Earth Orbit Benchmark](https://arxiv.org/abs/2401.17600) | [Project Page](https://vleo.danielz.ch/) | 444 | | NAIP-OSM | [NAIP-OSM: Aerial Imagery and Map Alignment](https://arxiv.org/pdf/2110.04690) | [Project Page](https://favyen.com/muno21/) | 445 | 446 | #### Embodied Tasks 447 | 448 | | Benchmark | Paper | Project Page | 449 | | :---------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 450 | | Embodied Questioning Answering (EQA) | [EXPRESS-Bench: Embodied Question Answering](https://arxiv.org/abs/2503.11117) | [Project Page](https://embodiedqa.org/) | 451 | | R2R (Room-to-Room) | [R2R: Room-to-Room Navigation](https://arxiv.org/abs/1711.07280) | [Github](https://github.com/xjli/r2r_vln) | 452 | | Reverie | [Reverie: Remote Embodied Visual Referring Expression](https://arxiv.org/abs/1904.10151) | [Project Page](https://yuankaiqi.github.io/REVERIE_Challenge/dataset.html) | 453 | | Alfred | [Alfred: A Benchmark for Interpreting Grounded Instructions](https://arxiv.org/abs/1912.01734) | [Github](https://github.com/askforalfred/alfred) | 454 | | Calvin | [Calvin: Long-Horizon Language-Conditioned Robot Learning](https://arxiv.org/abs/2112.03227) | [Github](https://github.com/mees/calvin) | 455 | | EPIC-KITCHENS | [EPIC-KITCHENS: Large Scale Dataset in First Person Vision](https://arxiv.org/abs/2005.00343) | [Project Page](https://epic-kitchens.github.io/) | 456 | | Ego4D | [Ego4D: Around the World in 3,000 Hours](https://arxiv.org/abs/2110.07058) | [Project Page](https://ego4d-data.org/) | 457 | | EMQA | [EMQA: Ego-centric Multimodal Question Answering](https://arxiv.org/abs/2205.01652) | [Github](https://github.com/lbaermann/qaego4d) | 458 | | SQA3D | [SQA3D: Situated Question Answering in 3D Scenes](https://arxiv.org/abs/2210.07474) | [Project Page](https://sqa3d.github.io/) | 459 | | Open-EQA | [Open-EQA: Open-Vocabulary Embodied Question Answering](https://open-eqa.github.io/assets/pdfs/paper.pdf) | [Project Page](https://open-eqa.github.io/) | 460 | | HM-EQA | [HM-EQA: Hierarchical Multi-modal Embodied QA](https://arxiv.org/abs/2403.15941v1) | [Project Page](https://explore-eqa.github.io/) | 461 | | MOTIF | [MOTIF: Multimodal Object-Text Interaction Framework](https://arxiv.org/abs/2502.12479) | [Github](https://github.com/blt2114/MotifBench) | 462 | | EgoTaskQA | [EgoTaskQA: Understanding Tasks in Egocentric Videos](https://arxiv.org/abs/2210.03929) | [Project Page](https://sites.google.com/view/egotaskqa) | 463 | | EmbodiedScan | [EmbodiedScan: Holistic Multi-Modal 3D Perception](https://arxiv.org/abs/2312.16170) | [Project Page](https://tai-wang.github.io/embodiedscan/) | 464 | | RH20T-P | [RH20T-P: Robotic Manipulation Dataset](https://arxiv.org/abs/2403.19622) | [Project Page](https://rh20t.github.io/) | 465 | | EXPRESS-Bench | [EXPRESS-Bench: Embodied Question Answering](https://arxiv.org/abs/2503.11117) | [Github](https://github.com/HCPLab-SYSU/EXPRESS-Bench) | 466 | | EmbodiedEval | [EmbodiedEval: Embodied AI Evaluation](https://arxiv.org/abs/2501.11858) | [Project Page](https://embodiedeval.github.io/) | 467 | | Embodied Bench | [Embodied Bench: Comprehensive Embodied AI Evaluation](https://arxiv.org/abs/2502.09560v1) | [Project Page](https://embodiedbench.github.io/) | 468 | | VLABench | [VLABench: Vision-Language-Action Benchmark](https://arxiv.org/abs/2412.18194) | [Project Page](https://vlabench.github.io/) | 469 | | EWMBench | [EWMBench: Embodied World Model Benchmark](https://arxiv.org/abs/2505.09694) | [Github](https://github.com/AgibotTech/EWMBench) | 470 | | NeurIPS 2025 Embodied Agent Interface Challenge | NeurIPS 2025 Embodied Agent Interface Challenge | [Project Page](https://neurips25-eai.github.io/) | 471 | | SEER-Bench | [Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration](https://arxiv.org/abs/2512.02458) | *Not available* | 472 | | ReMindView-Bench | [Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective](https://arxiv.org/abs/2512.02340) | [Github](https://github.com/pittisl/ReMindView-Bench) | 473 | 474 | #### AI Agent 475 | | Benchmark | Paper | Project Page | 476 | | :--------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 477 | | Self-rag | [Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection](https://arxiv.org/abs/2310.11511) | [GitHub](https://github.com/AkariAsai/self-rag) | 478 | | AMEM | [A-MEM: Agentic Memory for LLM Agents](https://arxiv.org/abs/2502.12110) | [GitHub](https://github.com/agiresearch/A-mem) | 479 | 480 | ## Generation Evaluation 481 | 482 | ### Image 483 | 484 | | Benchmark | Paper | Project Page | 485 | | :--------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 486 | | DiffusionDB | [DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models](https://arxiv.org/pdf/2210.14896) | [GitHub](https://poloclub.github.io/diffusiondb/) | 487 | | HPD、HPD v2、HPS | [Human Preference Score: Better Aligning Text-to-Image Models with Human Preference](https://arxiv.org/pdf/2303.14420) | [GitHub](https://tgxs002.github.io/align_sd_web/) | 488 | | ImageReward、ImageReward/ReFL | [ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation](https://arxiv.org/pdf/2304.05977) | [GitHub](https://github.com/zai-org/ImageReward) | 489 | | Pick-A-Pic、PickScore | [Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation](https://arxiv.org/pdf/2305.01569) | [GitHub](https://github.com/yuvalkirstain/PickScore) | 490 | | AGIQA-1K | [A Perceptual Quality Assessment Exploration for AIGC Images](https://arxiv.org/pdf/2303.12618) | [GitHub](https://github.com/lcysyzxdxc/AGIQA-1k-Database/tree/main) | 491 | | AGIQA-3K | [AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment](https://arxiv.org/pdf/2306.04717) | [GitHub](https://github.com/lcysyzxdxc/AGIQA-3k-Database) | 492 | | AIGCIQA2023 | [AIGCIQA2023: A Large-scale Image Quality Assessment Database for AI Generated Images: from the Perspectives of Quality, Authenticity and Correspondence](https://arxiv.org/pdf/2307.00211) | [Hugging Face](https://huggingface.co/datasets/strawhat/aigciqa2023) | 493 | | AGIN、JOINT | [Exploring the Naturalness of AI-Generated Images](https://arxiv.org/pdf/2312.05476) | [GitHub](https://github.com/zijianchen98/AGIN?utm_source=catalyzex.com) | 494 | | AIGIQA-20K | [AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment](https://arxiv.org/pdf/2404.03407) | [Hugging Face](https://huggingface.co/datasets/strawhat/aigciqa-20k) | 495 | | AIGCOIQA2024 | [AIGCOIQA2024: Perceptual Quality Assessment of AI Generated Omnidirectional Images](https://arxiv.org/pdf/2404.01024) | [GitHub](https://github.com/IntMeGroup/AIGCOIQA) | 496 | | CMC-Bench | [CMC-Bench: Towards a New Paradigm of Visual Signal Compression](https://arxiv.org/pdf/2406.09356) | [GitHub](https://github.com/Q-Future/CMC-Bench) | 497 | | PKU-I2IQA、NR/FR-AIGCIQA | [PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images](https://arxiv.org/pdf/2311.15556) | [GitHub](https://github.com/jiquan123/I2IQA) | 498 | | SeeTRUE | [What You See is What You Read? Improving Text-Image Alignment Evaluation](https://arxiv.org/pdf/2305.10400) | [GitHub](https://wysiwyr-itm.github.io/) | 499 | | AIGCIQA2023+、MINT-IQA | [Quality Assessment for AI Generated Images with Instruction Tuning](https://arxiv.org/pdf/2405.07346) | [GitHub](https://github.com/IntMeGroup/MINT-IQA) | 500 | | Q-Eval-100K、Q-Eval Score | [Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content](https://arxiv.org/pdf/2503.02357) | [GitHub](https://github.com/zzc-1998/Q-Eval) | 501 | | Measuring the Quality of Text-to-Video Model Outputs | [Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset](https://arxiv.org/pdf/2309.08009) | [Dataset Download](https://figshare.com/articles/dataset/Text_prompts_and_videos_generated_using_5_popular_Text-to-Video_models_plus_quality_metrics_including_user_quality_assessments/24078045) | 502 | | EvalCrafter | [EvalCrafter: Benchmarking and Evaluating Large Video Generation Models](https://arxiv.org/pdf/2310.11440) | [GitHub](https://evalcrafter.github.io/) | 503 | | FETV | [FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation](https://arxiv.org/pdf/2311.01813) | [GitHub](https://github.com/llyx97/FETV) | 504 | | VBench | [VBench: Comprehensive Benchmark Suite for Video Generative Models](https://arxiv.org/pdf/2311.17982) | [GitHub](https://vchitect.github.io/VBench-project/) | 505 | | T2VQA-DB | [Subjective-Aligned Dataset and Metric for Text-to-Video Quality](https://arxiv.org/pdf/2403.11956) | [GitHub](https://github.com/QMME/T2VQA) | 506 | | GAIA | [GAIA: Rethinking Action Quality Assessment for AI-Generated Videos](https://arxiv.org/pdf/2406.06087) | [GitHub](https://github.com/zijianchen98/GAIA?utm_source=catalyzex.com) | 507 | | AIGVQA-DB、AIGV-Assessor | [AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM](https://arxiv.org/pdf/2411.17221) | [GitHub](https://github.com/wangjiarui153/AIGV-Assessor) | 508 | | AIGVE-60K、LOVE | [LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation](https://arxiv.org/pdf/2505.12098) | [GitHub](https://github.com/IntMeGroup/LOVE) | 509 | | Human-AGVQA-DB、GHVQ | [Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric](https://arxiv.org/pdf/2411.16619) | [GitHub](https://github.com/zczhang-sjtu/GHVQ) | 510 | | TDVE-DB、TDVE-Assessor | [TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs](https://arxiv.org/pdf/2505.19535) | [GitHub](https://github.com/JuntongWang/TDVE-Assessor) | 511 | | AGAVQA-3K、AGAV-Rater | [AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment](https://arxiv.org/pdf/2501.18314) | [GitHub](https://github.com/charlotte9524/AGAV-Rater) | 512 | | Qwen-ALLD | [Audio Large Language Models Can Be Descriptive Speech Quality Evaluators](https://arxiv.org/pdf/2501.17202) | [Hugging Face](https://huggingface.co/datasets/PeacefulData/speech-quality-descriptive-caption) | 513 | | BASE-TTS | [BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data](https://arxiv.org/pdf/2402.08093) | [Audio samples of BASE-TTS](https://www.amazon.science/base-tts-samples) | 514 | | ATT | [Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese](https://arxiv.org/pdf/2505.11200) | [Hugging Face](https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4) | 515 | | TTSDS2 | [TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems](https://arxiv.org/pdf/2506.19441) | [Website](https://ttsdsbenchmark.com/) | 516 | | MATE-3D、LGVQ | [Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation](https://arxiv.org/pdf/2412.11170) | [GitHub](https://mate-3d.github.io/) | 517 | | 3DGCQA | [3DGCQA: A Quality Assessment Database for 3D AI-Generated Contents](https://arxiv.org/pdf/2409.07236) | [GitHub](https://github.com/zyj-2000/3DGCQA) | 518 | | AIGC-T23DAQA | [Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model](https://arxiv.org/pdf/2502.16915) | [GitHub](https://github.com/ZedFu/T23DAQA?utm_source=catalyzex.com) | 519 | | SI23DCQA | [SI23DCQA: Perceptual Quality Assessment of Single Image-to-3D Content](https://jhc.sjtu.edu.cn/~xiaohongliu/papers/2025SI23DCQA.pdf) | [GitHub](https://github.com/ZedFu/SI23DCQA) | 520 | | 3DGS-IEval-15K | [3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting](https://arxiv.org/pdf/2506.14642) | [GitHub ](https://github.com/YukeXing/3DGS-IEval-15K) | 521 | | Inception Score (IS) | [Improved Techniques for Training GANs](https://arxiv.org/pdf/1606.03498) | [GitHub](https://github.com/openai/improved-gan) | 522 | | FVD | [Towards Accurate Generative Models of Video: A New Metric & Challenges](https://arxiv.org/pdf/1812.01717) | [GitHub](https://github.com/google-research/google-research/tree/master/frechet_video_distance) | 523 | | VQAScore | [Evaluating Text-to-Visual Generation with Image-to-Text Generation](https://arxiv.org/pdf/2404.01291) | [GitHub](https://linzhiqiu.github.io/papers/vqascore/) | 524 | | NTIRE 2024 AIGC QA | [NTIRE 2024 Quality Assessment of AI-Generated Content Challenge](https://arxiv.org/pdf/2404.16687) | [Website](https://cvlai.net/ntire/2024/) | 525 | | Q-Bench | [Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision](https://arxiv.org/pdf/2309.14181) | [GitHub](https://q-future.github.io/Q-Bench/) | 526 | | Q-instruct | [Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models](https://arxiv.org/pdf/2311.06783) | [GitHub](https://q-future.github.io/Q-Instruct/) | 527 | | Q-align | [Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels](https://arxiv.org/pdf/2312.17090) | [GitHub](https://github.com/Q-Future/Q-Align) | 528 | | Q-boost | [Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models](https://arxiv.org/pdf/2312.15300) | [Project Page](https://q-future.github.io/Q-Instruct/boost_qa/) | 529 | | Co-Instruct | [Towards Open-ended Visual Quality Comparison](https://arxiv.org/pdf/2402.16641) | [Hugging Face](https://huggingface.co/q-future/co-instruct) | 530 | | DepictQA | [Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models](https://arxiv.org/pdf/2312.08962) | [GitHub](https://depictqa.github.io/) | 531 | | M3-AGIQA | [M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment](https://arxiv.org/pdf/2502.15167) | [GitHub](https://github.com/strawhatboy/M3-AGIQA) | 532 | | Q-Refine | [Q-Refine: A Perceptual Quality Refiner for AI-Generated Image](https://arxiv.org/pdf/2401.01117) | [GitHub](https://github.com/Q-Future/Q-Refine?utm_source=catalyzex.com) | 533 | | AGIQA | [Large Multi-modality Model Assisted AI-Generated Image Quality Assessment](https://arxiv.org/pdf/2404.17762) | [GitHub](https://github.com/wangpuyi/MA-AGIQA) | 534 | | SF-IQA | [SF-IQA: Quality and Similarity Integration for AI Generated Image Quality Assessment](https://openaccess.thecvf.com/content/CVPR2024W/NTIRE/papers/Yu_SF-IQA_Quality_and_Similarity_Integration_for_AI_Generated_Image_Quality_CVPRW_2024_paper.pdf) | [GitHub](https://github.com/Yvzhh/SF-IQA) | 535 | | SC-AGIQA | [Text-Visual Semantic Constrained AI-Generated Image Quality Assessment](https://arxiv.org/pdf/2507.10432) | [GitHub](https://github.com/mozhu1/SC-AGIQA) | 536 | | TSP-MGS | [AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity](https://arxiv.org/pdf/2411.16087) | *Not available* | 537 | | MoE-AGIQA | [MoE-AGIQA: Mixture-of-Experts Boosted Visual Perception-Driven and Semantic-Aware Quality Assessment for AI-Generated Images](https://openaccess.thecvf.com/content/CVPR2024W/NTIRE/papers/Yang_MoE-AGIQA_Mixture-of-Experts_Boosted_Visual_Perception-Driven_and_Semantic-Aware_Quality_Assessment_for_CVPRW_2024_paper.pdf) | [GitHub](https://github.com/37s/MoE-AGIQA) | 538 | | AMFF-Net | [Adaptive Mixed-Scale Feature Fusion Network for Blind AI-Generated Image Quality Assessment](https://arxiv.org/pdf/2404.15163) | [GitHub](https://github.com/TanSongBai/AMFF-Net) | 539 | | PSCR | [PSCR: Patches Sampling-based Contrastive Regression for AIGC Image Quality Assessment](https://arxiv.org/pdf/2312.05897) | [GitHub ](https://github.com/jiquan123/PSCR) | 540 | | TIER | [TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment](https://arxiv.org/pdf/2401.03854) | [GitHub](https://github.com/jiquan123/TIER?utm_source=catalyzex.com) | 541 | | IPCE | [AIGC Image Quality Assessment via Image-Prompt Correspondence](https://openaccess.thecvf.com/content/CVPR2024W/NTIRE/papers/Peng_AIGC_Image_Quality_Assessment_via_Image-Prompt_Correspondence_CVPRW_2024_paper.pdf) | [GitHub](https://github.com/pf0607/IPCE) | 542 | | RISEBench | [Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing](https://arxiv.org/pdf/2504.02826) | [GitHub](https://github.com/PhoenixZ810/RISEBench) | 543 | | GoT | [GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing](https://arxiv.org/pdf/2503.10639) | [GitHub ](https://github.com/rongyaofang/GoT) | 544 | | SmartEdit | [SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models](https://arxiv.org/pdf/2312.06739) | [GitHub ](https://yuzhou914.github.io/SmartEdit/) | 545 | | WISE | [WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation](https://arxiv.org/pdf/2503.07265) | [GitHub](https://github.com/PKU-YuanGroup/WISE) | 546 | | KRIS-Bench | [KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models](https://arxiv.org/pdf/2505.16707) | [GitHub](https://yongliang-wu.github.io/kris_bench_project_page/?utm_source=catalyzex.com) | 547 | | CoT-editing | [Enhancing Image Editing with Chain-of-Thought Reasoning and Multimodal Large Language Models](https://ieeexplore.ieee.org/document/10890562) | *Not available* | 548 | | GUIZoom-Bench | [Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding](https://arxiv.org/abs/2512.05941) | [Github](https://github.com/Princeton-AI2-Lab/ZoomClick) | 549 | | MICo-Bench | [MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition](https://arxiv.org/abs/2512.07348) | [Github](https://mico-150k.github.io) | 550 | | CS-Bench | [START: Spatial and Textual Learning for Chart Understanding](https://arxiv.org/abs/2512.07186) | *Not available* | 551 | | IF-Bench | [IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting](https://arxiv.org/abs/2512.09663) | [Github](https://github.com/casiatao/IF-Bench) | 552 | 553 | ### Video 554 | 555 | | Benchmark | Paper | Project Page | 556 | | :-------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 557 | | LMM-VQA | [LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models](https://arxiv.org/pdf/2408.14008) | [GitHub](https://github.com/Sueqk/LMM-VQA) | 558 | | FineVQ | [FineVQ: Fine-Grained User Generated Content Video Quality Assessment](https://arxiv.org/pdf/2412.19238) | [GitHub](https://github.com/IntMeGroup/FineVQ) | 559 | | VQA^2 | [VQA2 : Visual Question Answering for Video Quality Assessment](https://arxiv.org/pdf/2411.03795) | [GitHub](https://github.com/Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment) | 560 | | Omni-VQA | [Scaling-up Perceptual Video Quality Assessment](https://arxiv.org/pdf/2505.22543) | *Not available* | 561 | | LMM-PVQA | [Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision](https://arxiv.org/pdf/2505.03631) | [GitHub](https://github.com/clh124/LMM-PVQA) | 562 | | Compare2Score paradigm | [Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare](https://arxiv.org/pdf/2405.19298) | [GitHub](https://compare2score.github.io/) | 563 | | VQ-Insight | [VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning](https://arxiv.org/pdf/2506.18564) | [GitHub](https://github.com/bytedance/Q-Insight) | 564 | | Who is a Better Talker | [Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads](https://arxiv.org/pdf/2507.23343) | [GitHub](https://github.com/zyj-2000/Talker) | 565 | | THQA | [THQA: A Perceptual Quality Assessment Database for Talking Heads](https://arxiv.org/pdf/2404.09003) | [GitHub](https://github.com/zyj-2000/THQA) | 566 | | Who is a Better Imitator | [Who is a Better Imitator: Subjective and Objective Quality Assessment of Animated Humans](https://jhc.sjtu.edu.cn/~xiaohongliu/papers/2025imitator.pdf) | [GitHub](https://github.com/zyj-2000/Imitator) | 567 | | MI3S | [MI3S: A multimodal large language model assisted quality assessment framework for AI-generated talking heads](https://pdf.sciencedirectassets.com/271647/1-s2.0-S0306457325X00054/1-s2.0-S0306457325002626/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBcaCXVzLWVhc3QtMSJGMEQCICpU%2B72AWLLqhVTeZBT8KKIM6HqrxaWNhfx1XmMJws%2B5AiBz7ugtVWdvZs%2BTHix%2FkTyitgCHdYQs%2FgWZkhMsnEml%2FCqzBQhvEAUaDDA1OTAwMzU0Njg2NSIMXxdBi9ZA2gW0dKiMKpAF5OoKlRvyeBg%2Fzn2oPBKZoQjz4CSdu3Jy1XEkDSOfGZY3YD2JqyP7DAZMcltmpirR0C1tKf%2FnBi%2BvnHqdFbGm4YLBI9QNVmCRAKmMluCSHcLxBJbjYAazVMyJBPdWoFIQ2GbbQ42iOYdtNW%2FgtiyKU6MWcbb4KF8wKtLp6r4rj%2B38IM2IMlisN0277DwRBkYpjjCJk7G%2F4egUGIVY8Nje%2FGXlk0n1iASjUZKqsW7nQ759wgjmv5qgAm%2BwGqs7sB%2Bslv4GgLrXJzAeSqofQmfID%2BHHf%2Bz0B2aXMM3BUB3pOCW6szSEpHiDyk5izmFCYHxO4H5zYvySdzZvp9MOqnSYnWo2jDHIMHpfByhN7HWaZ223cZpAhQysTajsje1XNJGNatnUY5xFugh7B9zXpg9uuRhnkbUJN9%2BpFdjGoMHNxIFJUk5lZpX9xZ89jpFNzORGRw%2BuTxevGlob4jebb2j%2FMvlqOs7Jgfn5iZIcReWv7slfkHwgdcnnejkbFCPn5XMMgzku2%2FAUJOSmyZ6dV8xLxf%2Bn9a57n6VOhMORtBhMm7dx6I7fMAKSoknbWsnzl2nKql%2BwJp3cQlOsbgNEIVYKOtKft8cMsWDZt9w%2BLg6oLrEiqI3zEfsp4ccLQH0CfA7h9fAKQrZGySo%2FBsc%2FPmbw5A3yMJHHdLzFMBMXcuHqIxBeKPojYSH2ygYfuCwYBYbWe3IgbI21rjTYxTQ%2F1ACzX609a49d7OLKGp1HyDBQ137q2bk9DqFQAf27RDQojF3nid7U1gbOladQp9YR4aAQkNFj6emt4O39wTh2ug1%2F%2FBwURYkuuK7vxBxlumkJbDaNLV07ukn2tteddb51wYG%2F6Zo4czXQgItliQ6xcaQvzzsw9p61xQY6sgEo7PmFzMIHdYAjPhZXODMcw3XdsBC9Tl4%2BhQKHzQN6dIH1z6jx%2BJsEcp%2FTBp4H8Bis%2FuSSdKlnnRo5o7s2mR1q%2F0S5Df47BVAgtslpQIO8wRJ0lL1H5PZ8siXzsqHn%2BLi6xvqMH3q5vkE2NrHMWNTAwcEG7aUZ07cX9hyJUMqRLQSCS%2BPYnDw3NZSKZZFsdQbl7KfGHB%2F%2FBfKR%2FRtxOvBnzMcffup0Oq%2FOMspmpcIGwr%2F8&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20250826T070451Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYXD6JI5HF%2F20250826%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=377df3c1bc014b91a05786f7ec5553e07819d6123bb58869433e31e33b0abda7&hash=5d3d2a43208117496897344bb132d068845e4d3a80667eb15d956e3915ac3678&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S0306457325002626&tid=spdf-6c54bd1c-a6c5-4c5c-8126-5120be841082&sid=6bff0ae6832fe04476180138b052652ce1b0gxrqa&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&rh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=0e055b53545152065d&rr=97517a361b26e2f0&cc=hk&kca=eyJrZXkiOiJnV2RjazFEZldLb3lOVGUzdzVoSXdiQ2E1K05SNXVFemFqak5qK3lXVXoza0hmTExBTkxVMkp5bEVKcVZidEpEWFdGMUlDOTlQUHpoR095N2FzTnFQU0hOSi9CcndGL2NzVFNRSk5mcThQSG4vcm1YUEVwc1JrTWJlRFFMRmczQmd1Y0FWSGZ6UlRicktzaTZCZGpCcEt5OFQ1a2NHTHNTeG1UY3Y3TUNaWHl3ZkcxL0NRPT0iLCJpdiI6IjBhNmY1Mzc1YmFhZGEyNzA3ZDgwZGJiMGYwMmNmZTI3In0=_1756191904186) | *Not available* | 568 | | An Implementation of Multimodal Fusion System | [An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation](https://arxiv.org/pdf/2310.20251) | [GitHub](https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker) | 569 | | RULER-Bench | [RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence](https://arxiv.org/abs/2512.02622) | [GitHub](https://hexmseeu.github.io/RULER-Bench-proj/) | 570 | | PAI-Bench | [PAI-Bench: A Comprehensive Benchmark For Physical AI](https://arxiv.org/abs/2512.01989) | [Github](https://github.com/SHI-Labs/physical-ai-bench)| 571 | | Tri-Bench | [Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference](https://arxiv.org/abs/2512.08860) | [Github](https://github.com/Amiton7/Tri-Bench) | 572 | | OpenVE-Bench | [OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing](https://arxiv.org/abs/2512.07826) | [Github](https://lewandofskee.github.io/projects/OpenVE/) | 573 | | RVE-Bench | [ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning](https://arxiv.org/abs/2512.09924) | [Github](https://github.com/Liuxinyv/ReViSE) | 574 | 575 | ### Audio 576 | 577 | | Benchmark | Paper | Project Page | 578 | | :-----------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 579 | | MOSNet | [MOSNet: Deep Learning-based Objective Assessment for Voice Conversion](https://arxiv.org/pdf/1904.08352) | [GitHub](https://github.com/lochenchou/MOSNet?utm_source=catalyzex.com) | 580 | | MOSA-Net+ | [A Study on Incorporating Whisper for Robust Speech Assessment](https://arxiv.org/pdf/2309.12766) | [Hugging Face](https://huggingface.co/papers/2309.12766) | 581 | | MOSLight | [MOSLight: A Lightweight Data-Efficient System for Non-Intrusive Speech Quality Assessment](https://www.isca-archive.org/interspeech_2023/li23c_interspeech.pdf) | *Not available* | 582 | | MBNet | [MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network](https://arxiv.org/pdf/2103.00110) | [GitHub](https://github.com/CalayZhou/MBNet?tab=readme-ov-file) | 583 | | DeePMOS | [DeePMOS: Deep Posterior Mean-Opinion-Score of Speech](https://www.researchgate.net/profile/Fredrik-Cumlin/publication/373249119_DeePMOS_Deep_Posterior_Mean-Opinion-Score_of_Speech/links/676e7db6c1b0135465f772f5/DeePMOS-Deep-Posterior-Mean-Opinion-Score-of-Speech.pdf) | [GitHub](https://github.com/Hope-Liang/DeePMOS) | 584 | | LDNet | [LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech](https://arxiv.org/pdf/2110.09103) | [GitHub](https://github.com/unilight/LDNet) | 585 | | ADTMOS | [ADTMOS – Synthesized Speech Quality Assessment Based on Audio Distortion Tokens](https://ieeexplore.ieee.org/document/10938851) | [GitHub](https://github.com/redifinition/ADTMOS) | 586 | | UAMOS | [Uncertainty-Aware Mean Opinion Score Prediction](https://arxiv.org/pdf/2408.12829) | *Not available* | 587 | | Audiobox Aesthetics | [Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound](https://arxiv.org/pdf/2502.05139) | [GitHub](https://github.com/facebookresearch/audiobox-aesthetics) | 588 | | HighRateMOS | [HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment](https://arxiv.org/pdf/2506.21951) | *Not available* | 589 | | ALLMs-as-Judges | [Audio-Aware Large Language Models as Judges for Speaking Styles](https://arxiv.org/pdf/2506.05984) | [Hugging Face](https://huggingface.co/papers/2506.05984) | 590 | | SALMONN | [SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/pdf/2310.13289) | [GitHub](https://github.com/bytedance/SALMONN) | 591 | | Qwen-Audio | [Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models](https://arxiv.org/pdf/2311.07919) | [GitHub](https://github.com/QwenLM/Qwen-Audio) | 592 | | Qwen2- Audio | [Qwen2-Audio Technical Report](https://arxiv.org/pdf/2407.10759) | [GitHub](https://github.com/QwenLM/Qwen2-Audio) | 593 | | natural language quality descriptions | [Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation](https://arxiv.org/pdf/2409.16644) | [GitHub](https://github.com/bytedance/SALMONN) | 594 | | QualiSpeech | [QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions](https://arxiv.org/pdf/2503.20290) | [Hugging Face](https://huggingface.co/datasets/tsinghua-ee/QualiSpeech) | 595 | | DiscreteEval | [Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model](https://arxiv.org/pdf/2405.09768) | [GitHub](https://swatsw.github.io/lrec24_eval_slm/) | 596 | | EmergentTTS-Eval | [EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge](https://arxiv.org/pdf/2505.23009) | [GitHub](https://github.com/boson-ai/EmergentTTS-Eval-public) | 597 | | InstructTTSEval | [INSTRUCTTTSEVAL: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems](https://arxiv.org/pdf/2506.16381) | [GitHub](https://github.com/KexinHUANG19/InstructTTSEval) | 598 | | Mos-Bench | [MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models](https://arxiv.org/pdf/2411.03715) | [Hugging Face](https://huggingface.co/papers/2411.03715) | 599 | | SH-Bench | [Protecting Bystander Privacy via Selective Hearing in LALMs](https://arxiv.org/abs/2512.06380) | [Huggingface](https://huggingface.co/datasets/BrianatCambridge/SelectiveHearingBench) | 600 | | LISN-Bench | [LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating](https://arxiv.org/abs/2512.09920) | [Github](https://social-nav.github.io/LISN-project/) | 601 | 602 | ### 3D 603 | 604 | | Benchmark | Paper | Project Page | 605 | | :--------------: | :----------------------------------------------------------: | :---------------------------------------------------------: | 606 | | NR-3DQA | [No-Reference Quality Assessment for 3D Colored Point Cloud and Mesh Models](https://arxiv.org/pdf/2107.02041) | [GitHub](https://github.com/zzc-1998/NR-3DQA) | 607 | | MM-PCQA | [MM-PCQA: Multi-Modal Learning for No-reference Point Cloud Quality Assessment](https://arxiv.org/pdf/2209.00244) | [GitHub](https://github.com/zzc-1998/MM-PCQA) | 608 | | GT23D-Bench | [GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark](https://arxiv.org/pdf/2412.09997) | [GitHub](https://gt23d-bench.github.io/) | 609 | | NeRF-NQA | [NeRF-NQA: No-Reference Quality Assessment for Scenes Generated by NeRF and Neural View Synthesis Methods](https://arxiv.org/pdf/2412.08029) | [GitHub](https://github.com/VincentQQu/NeRF-NQA) | 610 | | Explicit-NeRF-QA | [Explicit-NeRF-QA: A Quality Assessment Database for Explicit NeRF Model Compression](https://arxiv.org/pdf/2407.08165) | [GitHub](https://github.com/YukeXing/Explicit-NeRF-QA) | 611 | | NeRF-QA | [NeRF-QA: Neural Radiance Fields Quality Assessment Database](https://arxiv.org/pdf/2305.03176) | [GitHub](https://github.com/pedrogcmartin/NeRF-QA-Database) | 612 | | NVS-QA | [NeRF View Synthesis: Subjective Quality Assessment and Objective Metrics Evaluation](https://arxiv.org/pdf/2405.20078) | [GitHub](https://github.com/pedrogcmartin/NVS-QA) | 613 | | GSQA | [GS-QA: Comprehensive Quality Assessment Benchmark for Gaussian Splatting View Synthesis](https://arxiv.org/pdf/2502.13196) | [GitHub](https://github.com/pedrogcmartin/GS-QA) | 614 | | GPT-4V Evaluator | [GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation](https://arxiv.org/pdf/2401.04092) | [GitHub](https://gpteval3d.github.io/) | 615 | | Eval3D | [Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation](https://arxiv.org/pdf/2504.18509) | [GitHub](https://eval3d.github.io/) | 616 | | 3DGen-Bench | [3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models](https://arxiv.org/pdf/2503.21745) | [GitHub](https://zyh482.github.io/3DGen-Bench/) | 617 | | LMM-PCQA | [LMM-PCQA: Assisting Point Cloud Quality Assessment with LMM](https://arxiv.org/pdf/2404.18203?) | [GitHub](https://github.com/Q-Future/LMM-PCQA) | 618 | 619 | ## Leaderboards and Tools 620 | 621 | | Platform / Benchmark | Paper | Project Page | 622 | | :-----------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | 623 | | LMMs-Eval | [LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models](https://arxiv.org/pdf/2407.12772) | [GitHub](https://github.com/EvolvingLMMs-Lab/lmms-eval) | 624 | | GenAI-Arena | [Genai arena: An open evaluation platform for generative models](https://arxiv.org/pdf/2406.04485) | [Hugging Face](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena) | 625 | | OpenCompass | | [GitHub](https://github.com/open-compass/opencompass) | 626 | | Epoch AI’s Benchmarking Hub | | [Website](https://epoch.ai/benchmarks) | 627 | | Artificial Analysis | | [Website](https://artificialanalysis.ai/) | 628 | | Scale’s SEAL Leaderboards | | [Website](https://scale.com/leaderboard) | 629 | | FlagEval | [FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation](https://arxiv.org/pdf/2506.09081) | [Website](https://flageval.baai.ac.cn/#/home) | 630 | | AGI-Eval | [AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models](https://arxiv.org/pdf/2304.06364) | [Website](https://agi-eval.cn/mvp/listSummaryIndex) | 631 | | ReLE | | [Website](https://nonelinear.com/static/benchmarking.html) | 632 | | VLMEvalKit、OpenVLM Leaderboard | [VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models](https://arxiv.org/pdf/2407.11691) | [GitHub](https://github.com/open-compass/VLMEvalKit) | 633 | | HELM | [Holistic Evaluation of Language Models](https://arxiv.org/pdf/2211.09110) | [Project Page](https://crfm.stanford.edu/helm/classic/v0.3.0/) | 634 | | LiveBench | [LiveBench: A Challenging, Contamination-Limited LLM Benchmark](https://arxiv.org/pdf/2406.19314) | [Project Page](https://livebench.ai/#/) | 635 | | SuperCLUE | [SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark](https://arxiv.org/pdf/2307.15020) | [Website](https://www.cluebenchmarks.com/) | 636 | | AIBench | [AIbench: Towards Trustworthy Evaluation Under the 45 Law](https://www.researchgate.net/profile/Zicheng-Zhang-9/publication/393362210_AIBENCH_TOWARDS_TRUSTWORTHY_EVALUA-_TION_UNDER_THE_45LAW/links/6867747be4632b045dc9b47c/AIBENCH-TOWARDS-TRUSTWORTHY-EVALUA-TION-UNDER-THE-45LAW.pdf) | [Website](https://aiben.ch) | 637 | | FutureX | [FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction](https://arxiv.org/pdf/2508.11987) | [Github](https://futurex-ai.github.io/) | 638 | 639 | --------------------------------------------------------------------------------