└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models 2 | A most Frontend Collection and survey of vision-language model papers, and models GitHub repository 3 | 4 | Below we compile *awesome* papers and model and github repositories that 5 | - **State-of-the-Art VLMs** Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks). 6 | - **Evaluate** VLM benchmarks and corresponding link to the works 7 | - **Post-training/Alignment** Newest related work for VLM alignment including RL, sft. 8 | - **Applications** applications of VLMs in embodied AI, robotics, etc. 9 | - Contribute **surveys**, **perspectives**, and **datasets** on the above topics. 10 | 11 | 12 | Welcome to contribute and discuss! 13 | 14 | --- 15 | 16 | 🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper. 17 | 18 | --- 19 | 20 | ## Table of Contents 21 | * [📄 Paper Link](https://arxiv.org/abs/2501.02189)/[⛑️ Citation](#Citations) 22 | * 1. [📚 SoTA VLMs](#vlms) 23 | * 2. [🗂️ Dataset and Evaluation](#Dataset) 24 | * 2.1. [Large Scale Pre-Training & Post-Training Dataset](#TrainingDatasetforVLM) 25 | * 2.2. [Datasets and Evaluation for VLM](#DatasetforVLM) 26 | * 2.3. [Benchmark Datasets, Simulators and Generative Models for Embodied VLM](#DatasetforEmbodiedVLM) 27 | 28 | * 3. ##### 🔥 [ Post-Training/Alignment/prompt engineering](#posttraining) 🔥 29 | * 3.1. [RL Alignment for VLM](#alignment) 30 | * 3.2. [Regular finetuning (SFT)](#sft) 31 | * 3.3. [VLM Alignment Github](#vlm_github) 32 | * 3.4. [Prompt Engineering](#vlm_prompt_engineering) 33 | 34 | * 4. [⚒️ Applications](#Toolenhancement) 35 | * 4.1. [Embodied VLM agents](#EmbodiedVLMagents) 36 | * 4.2. [Generative Visual Media Applications](#GenerativeVisualMediaApplications) 37 | * 4.3. [Robotics and Embodied AI](#RoboticsandEmbodiedAI) 38 | * 4.3.1. [Manipulation](#Manipulation) 39 | * 4.3.2. [Navigation](#Navigation) 40 | * 4.3.3. [Human-robot Interaction](#HumanRobotInteraction) 41 | * 4.3.4. [Autonomous Driving](#AutonomousDriving) 42 | * 4.4. [Human-Centered AI](#Human-CenteredAI) 43 | * 4.4.1. [Web Agent](#WebAgent) 44 | * 4.4.2. [Accessibility](#Accessibility) 45 | * 4.4.3. [Healthcare](#Healthcare) 46 | * 4.4.4. [Social Goodness](#SocialGoodness) 47 | * 5. [⛑️ Challenges](#Challenges) 48 | * 5.1. [Hallucination](#Hallucination) 49 | * 5.2. [Safety](#Safety) 50 | * 5.3. [Fairness](#Fairness) 51 | * 5.4. [Alignment](#Alignment) 52 | * 5.4.1. [Multi-modality Alignment](#MultimodalityAlignment) 53 | * 5.4.2. [Commonsense and Physics Alignment](#CommonsenseAlignment) 54 | * 5.5. [Efficient Training and Fine-Tuning](#EfficientTrainingandFineTuning) 55 | * 5.6. [Scarce of High-quality Dataset](#ScarceofHighqualityDataset) 56 | 57 | 58 | ## 0. Citation 59 | 60 | ``` 61 | @InProceedings{Li_2025_CVPR, 62 | author = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao}, 63 | title = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges}, 64 | booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, 65 | month = {June}, 66 | year = {2025}, 67 | pages = {1587-1606} 68 | } 69 | ``` 70 | 71 | --- 72 | 73 | ## 1. 📚 SoTA VLMs 74 | | Model | Year | Architecture | Training Data | Parameters | Vision Encoder/Tokenizer | Pretrained Backbone Model | 75 | |--------------------------------------------------------------|------|----------------|-----------------------------|----------------|-----------------------------------------------|---------------------------------------------------| 76 | | [Gemini 3](https://aistudio.google.com/models/gemini-3) | 11/18/2025 | Unified Model |Undisclosed| - | - | - 77 | | [Emu3.5](https://arxiv.org/pdf/2510.26583) | 10/30/2025 | Deconder-only |Unified Modality Dataset | - | SigLIP | [Qwen3](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) 78 | | [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf) | 10/20/2025 | Encoder-Deconder |70% OCR, 20% general vision, 10% text-only | [3B](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | DeepEncoder | DeepSeek-3B 79 | | [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) | 10/11/2025 | Decoder-Only |- | [8B/4B](https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe) | ViT | [Qwen3](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) 80 | | [Qwen3-VL-MoE](https://github.com/QwenLM/Qwen3-VL) | 09/25/2025 | Decoder-Only |- | [235B-A22B](https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe) | ViT | [Qwen3](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) 81 | | [Qwen3-Omni](https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf) (Visual/Audio/Text)| 09/21/2025 | - |Video/Audio/Image | 30B | ViT | Qwen3-Omni-MoE-Thinker 82 | | [LLaVA-Onevision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)| 09/15/2025 | - |[Mid-Training-85M](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) & [SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | 8B | Qwen2VLImageProcessor | [Qwen3](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) 83 | | [InternVL3.5](https://arxiv.org/abs/2508.18265)| 08/25/2025 | Decoder-Only |multimodal & text-only | 30B/38B/241B | InternViT-300M/6B | [Qwen3](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) / [GPT-OSS](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4) 84 | | [SkyWork-Unipic-1.5B](https://huggingface.co/Skywork/Skywork-UniPic-1.5B)| 07/29/2025 | - |image/video.. | - | - | - 85 | | [Grok 4](https://x.ai/news/grok-4) | 07/09/2025 | - |image/video.. | 1-2 Trillion | - | - 86 | | [Kwai Keye-VL (Kuaishou)](https://arxiv.org/abs/2507.01949) | 07/02/2025 | Decdoer-only |image/video.. | 8B | ViT | [QWen-3-8B](https://huggingface.co/Qwen/Qwen3-8B) 87 | | [OmniGen2](https://arxiv.org/abs/2506.18871) | 06/23/2025 | Decdoer-only & VAE |LLaVA-OneVision/ SAM-LLaVA.. | - | ViT | [QWen-2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) 88 | | [Gemini-2.5-Pro](https://deepmind.google/models/gemini/pro/) | 06/17/2025 | - |-| - | - | - 89 | | [GPT-o3/o4-mini](https://openai.com/index/introducing-o3-and-o4-mini/) | 06/10/2025 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed | 90 | | [Mimo-VL (Xiaomi)](https://arxiv.org/abs/2506.03569) | 06/04/2025 | Decdoer-only |24 Trillion MLLM tokens | 7B | [Qwen2.5-ViT | [Mimo-7B-base](https://huggingface.co/XiaomiMiMo/MiMo-7B-Base) 91 | | [BAGEL (Bytedance)](https://arxiv.org/abs/2505.14683) | 05/20/2025 | Unified Model | Video/Image/Text | 7B | SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786) | [Qwen2.5](https://arxiv.org/abs/2412.15115) 92 | | [BLIP3-o](https://www.arxiv.org/abs/2505.09568) | 05/14/2025 | Decdoer-only |(BLIP3-o 60K) GPT-4o Generated Image Generation Data | 4/8B | ViT | [QWen-2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) 93 | | [InternVL-3](https://arxiv.org/abs/2504.10479) | 04/14/2025 | Decdoer-only |200 Billion Tokens | 1/2/8/9/14/38/78B | ViT-300M/6B | [InterLM2.5/QWen2.5](https://huggingface.co/OpenGVLab/InternVL3-78B) 94 | | [LLaMA4-Scout/Maverick](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) | 04/04/2025 | Decdoer-only |40/20 Trillion Tokens | 17B | [MetaClip](https://github.com/facebookresearch/MetaCLIP) | [LLaMA4](https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164) 95 | | [Qwen2.5-Omni](https://arxiv.org/abs/2503.20215) | 03/26/2025 | Decdoer-only |Video/Audio/Image/Text | 7B |Qwen2-Audio/Qwen2.5-VL ViT | [End-to-End Mini-Omni](https://arxiv.org/abs/2408.16725) 96 | | [QWen2.5-VL](https://arxiv.org/abs/2502.13923) | 01/28/2025 | Decdoer-only |Image caption, VQA, grounding agent, long video | 3B/7B/72B |Redesigned ViT | [Qwen2.5](https://huggingface.co/Qwen) 97 | | [Ola](https://arxiv.org/pdf/2502.04328) | 2025 | Decoder-only |Image/Video/Audio/Text | 7B |[OryxViT](https://huggingface.co/THUdyh/Oryx-ViT)| [Qwen-2.5-7B](https://qwenlm.github.io/blog/qwen2.5/), [SigLIP-400M](https://arxiv.org/pdf/2303.15343), [Whisper-V3-Large](https://arxiv.org/pdf/2212.04356), [BEATs-AS2M(cpt2)](https://arxiv.org/pdf/2212.09058) 98 | | [Ocean-OCR](https://arxiv.org/abs/2501.15558) | 2025 | Decdoer-only | Pure Text, Caption, [Interleaved](https://github.com/OpenGVLab/MM-Interleaved), [OCR](https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5) | 3B | [NaViT](https://arxiv.org/pdf/2307.06304) | Pretrained from scratch 99 | | [SmolVLM](https://huggingface.co/blog/smolervlm) | 2025 | Decoder-only | [SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/blob/main/smolvlm-data.pdf) | 250M & 500M | SigLIP | [SmolLM](https://huggingface.co/blog/smollm) 100 | | [DeepSeek-Janus-Pro](https://janusai.pro/wp-content/uploads/2025/01/janus_pro_tech_report.pdf) | 2025 | Decoder-only | Undisclosed | 7B | SigLIP | [DeepSeek-Janus-Pro](https://huggingface.co/deepseek-ai/Janus-Pro-7B) | 101 | | [Inst-IT](https://arxiv.org/abs/2412.03565) | 2024 | Decoder-only | [Inst-IT Dataset](https://huggingface.co/datasets/Inst-IT/Inst-It-Dataset), [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) | 7B | CLIP/Vicuna, SigLIP/Qwen2 | [LLaVA-NeXT](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) | 102 | [DeepSeek-VL2](https://arxiv.org/pdf/2412.10302) | 2024 | Decoder-only | [WiT](https://huggingface.co/datasets/google/wit), [WikiHow](https://huggingface.co/datasets/ajibawa-2023/WikiHow) | 4.5B x 74 | SigLIP/SAMB | [DeepSeekMoE](https://arxiv.org/pdf/2412.10302) | 103 | | [xGen-MM (BLIP-3)](https://arxiv.org/pdf/2408.08872) | 2024 | Decoder-only | [MINT-1T](https://arxiv.org/pdf/2406.11271), [OBELICS](https://arxiv.org/pdf/2306.16527), [Caption](https://github.com/salesforce/LAVIS/tree/xgen-mm?tab=readme-ov-file#data-preparation) | 4B | ViT + [Perceiver Resampler](https://arxiv.org/pdf/2204.14198) | [Phi-3-mini](https://arxiv.org/pdf/2404.14219) | 104 | | [TransFusion](https://arxiv.org/pdf/2408.11039) | 2024 | Encoder-decoder| Undisclosed | 7B | VAE Encoder | Pretrained from scratch on transformer architecture | 105 | | [Baichuan Ocean Mini](https://arxiv.org/pdf/2410.08565) | 2024 | Decoder-only | Image/Video/Audio/Text | 7B | CLIP ViT-L/14 | [Baichuan](https://arxiv.org/pdf/2309.10305) | 106 | | [LLaMA 3.2-vision](https://arxiv.org/pdf/2407.21783) | 2024 | Decoder-only | Undisclosed | 11B-90B | CLIP | [LLaMA-3.1](https://arxiv.org/pdf/2407.21783) | 107 | | [Pixtral](https://arxiv.org/pdf/2410.07073) | 2024 | Decoder-only | Undisclosed | 12B | CLIP ViT-L/14 | [Mistral Large 2](https://mistral.ai/) | 108 | | [Qwen2-VL](https://arxiv.org/pdf/2409.12191) | 2024 | Decoder-only | Undisclosed | 7B-14B | EVA-CLIP ViT-L | [Qwen-2](https://arxiv.org/pdf/2407.10671) | 109 | | [NVLM](https://arxiv.org/pdf/2409.11402) | 2024 | Encoder-decoder| [LAION-115M ](https://laion.ai/blog/laion-5b/) | 8B-24B | Custom ViT | [Qwen-2-Instruct](https://arxiv.org/pdf/2407.10671) | 110 | | [Emu3](https://arxiv.org/pdf/2409.18869) | 2024 | Decoder-only | [Aquila](https://arxiv.org/pdf/2408.07410) | 7B | MoVQGAN | [LLaMA-2](https://arxiv.org/pdf/2307.09288) | 111 | | [Claude 3](https://claude.ai/new) | 2024 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed | 112 | | [InternVL](https://arxiv.org/pdf/2312.14238) | 2023 | Encoder-decoder| [LAION-en, LAION- multi](https://laion.ai/blog/laion-5b/) | 7B/20B | Eva CLIP ViT-g | [QLLaMA](https://arxiv.org/pdf/2304.08177) | 113 | | [InstructBLIP](https://arxiv.org/pdf/2305.06500) | 2023 | Encoder-decoder| [CoCo](https://cocodataset.org/#home), [VQAv2](https://huggingface.co/datasets/lmms-lab/VQAv2) | 13B | ViT | [Flan-T5](https://arxiv.org/pdf/2210.11416), [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | 114 | | [CogVLM](https://arxiv.org/pdf/2311.03079) | 2023 | Encoder-decoder| [LAION-2B](https://sisap-challenges.github.io/2024/datasets/) ,[COYO-700M](https://github.com/kakaobrain/coyo-dataset) | 18B | CLIP ViT-L/14 | [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | 115 | | [PaLM-E](https://arxiv.org/pdf/2303.03378) | 2023 | Decoder-only | All robots, [WebLI](https://arxiv.org/pdf/2209.06794) | 562B | ViT | [PaLM](https://arxiv.org/pdf/2204.02311) | 116 | | [LLaVA-1.5](https://arxiv.org/pdf/2310.03744) | 2023 | Decoder-only | [COCO](https://cocodataset.org/#home) | 13B | CLIP ViT-L/14 | [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | 117 | | [Gemini](https://arxiv.org/pdf/2312.11805) | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed | 118 | | [GPT-4V](https://arxiv.org/pdf/2309.17421) | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed | 119 | | [BLIP-2](https://arxiv.org/pdf/2301.12597) | 2023 | Encoder-decoder| [COCO](https://cocodataset.org/#home), [Visual Genome](https://huggingface.co/datasets/ranjaykrishna/visual_genome) | 7B-13B | ViT-g | [Open Pretrained Transformer (OPT)](https://arxiv.org/pdf/2205.01068) | 120 | | [Flamingo](https://arxiv.org/pdf/2204.14198) | 2022 | Decoder-only | [M3W](https://arxiv.org/pdf/2204.14198), [ALIGN](https://huggingface.co/docs/transformers/en/model_doc/align) | 80B | Custom | [Chinchilla](https://arxiv.org/pdf/2203.15556) | 121 | | [BLIP](https://arxiv.org/pdf/2201.12086) | 2022 | Encoder-decoder| [COCO](https://cocodataset.org/#home), [Visual Genome](https://huggingface.co/datasets/ranjaykrishna/visual_genome/) | 223M-400M | ViT-B/L/g | Pretrained from scratch | 122 | | [CLIP](https://arxiv.org/pdf/2103.00020) | 2021 | Encoder-decoder| 400M image-text pairs | 63M-355M | ViT/ResNet | Pretrained from scratch | 123 | 124 | 125 | 126 | 127 | ## 2. 🗂️ Benchmarks and Evaluation 128 | ### 2.1. Datasets for Training VLMs 129 | | Dataset | Task | Size | 130 | |---------|------|---------------| 131 | | [FineVision](https://huggingface.co/datasets/HuggingFaceM4/FineVision) | Mixed Domain | 24.3 M/4.48TB | 132 | 133 | 134 | 135 | ### 2.2. Datasets and Evaluation for VLM 136 | ### 🧮 Visual Math (+ Visual Math Reasoning) 137 | 138 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 139 | |---------|------|---------------|------------|----------|-------------| 140 | | [MathVision](https://arxiv.org/abs/2402.14804) | Visual Math | MC / Answer Match | Human | 3.04 | [Repo](https://mathllm.github.io/mathvision/) | 141 | | [MathVista](https://arxiv.org/abs/2310.02255) | Visual Math | MC / Answer Match | Human | 6 | [Repo](https://mathvista.github.io) | 142 | | [MathVerse](https://arxiv.org/abs/2403.14624) | Visual Math | MC | Human | 4.6 | [Repo](https://mathverse-cuhk.github.io) | 143 | | [VisNumBench](https://arxiv.org/abs/2503.14939) | Visual Number Reasoning | MC | Python Program generated/Web Collection/Real life photos | 1.91 | [Repo](https://wwwtttjjj.github.io/VisNumBench/) | 144 | 145 | 146 | 147 | ### 🎞️ Video Understanding 148 | 149 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 150 | |---------|------|---------------|------------|----------|-------------| 151 | | [VideoHallu](https://arxiv.org/abs/2505.01481) | Video Understanding | LLM Eval | Human | 3.2 | [Repo](https://github.com/zli12321/VideoHallu) | 152 | | [Video SimpleQA](https://arxiv.org/abs/2503.18923) | Video Understanding | LLM Eval | Human | 2.03 | [Repo](https://videosimpleqa.github.io) | 153 | | [MovieChat](https://arxiv.org/abs/2307.16449) | Video Understanding | LLM Eval | Human | 1 | [Repo](https://rese1f.github.io/MovieChat/) | 154 | | [Perception‑Test](https://arxiv.org/pdf/2305.13786) | Video Understanding | MC | Crowd | 11.6 | [Repo](https://github.com/google-deepmind/perception_test) | 155 | | [VideoMME](https://arxiv.org/pdf/2405.21075) | Video Understanding | MC | Experts | 2.7 | [Site](https://video-mme.github.io/) | 156 | | [EgoSchem](https://arxiv.org/pdf/2308.09126) | Video Understanding | MC | Synth / Human | 5 | [Site](https://egoschema.github.io/) | 157 | | [Inst‑IT‑Bench](https://arxiv.org/abs/2412.03565) | Fine‑grained Image & Video | MC & LLM | Human / Synth | 2 | [Repo](https://github.com/inst-it/inst-it) | 158 | 159 | 160 | ### 💬 Multimodal Conversation 161 | 162 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 163 | |---------|------|---------------|------------|----------|-------------| 164 | | [VisionArena](https://arxiv.org/abs/2412.08687) | Multimodal Conversation | Pairwise Pref | Human | 23 | [Repo](https://huggingface.co/lmarena-ai) | 165 | 166 | 167 | 168 | ### 🧠 Multimodal General Intelligence 169 | 170 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 171 | |---------|------|---------------|------------|----------|-------------| 172 | | [MMLU](https://arxiv.org/pdf/2009.03300) | General MM | MC | Human | 15.9 | [Repo](https://github.com/hendrycks/test) | 173 | | [MMStar](https://arxiv.org/pdf/2403.20330) | General MM | MC | Human | 1.5 | [Site](https://mmstar-benchmark.github.io/) | 174 | | [NaturalBench](https://arxiv.org/pdf/2410.14669) | General MM | Yes/No, MC | Human | 10 | [HF](https://huggingface.co/datasets/BaiqiL/NaturalBench) | 175 | | [PHYSBENCH](https://arxiv.org/pdf/2501.16411) | Visual Math Reasoning | MC | Grad STEM | 0.10 | [Repo](https://github.com/USC-GVL/PhysBench) | 176 | 177 | 178 | ### 🔎 Visual Reasoning / VQA (+ Multilingual & OCR) 179 | 180 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 181 | |---------|------|---------------|------------|----------|-------------| 182 | | [EMMA](https://arxiv.org/abs/2501.05444) | Visual Reasoning | MC | Human + Synth | 2.8 | [Repo](emma-benchmark.github.io) | 183 | | [MMTBENCH](https://arxiv.org/pdf/2404.16006) | Visual Reasoning & QA | MC | AI Experts | 30.1 | [Repo](https://github.com/tylin/coco-caption) | 184 | | [MM‑Vet](https://arxiv.org/pdf/2308.02490) | OCR / Visual Reasoning | LLM Eval | Human | 0.2 | [Repo](https://github.com/yuweihao/MM-Vet) | 185 | | [MM‑En/CN](https://arxiv.org/pdf/2307.06281) | Multilingual MM Understanding | MC | Human | 3.2 | [Repo](https://github.com/open-compass/VLMEvalKit) | 186 | | [GQA](https://arxiv.org/abs/2305.13245) | Visual Reasoning & QA | Answer Match | Seed + Synth | 22 | [Site](https://cs.stanford.edu/people/dorarad/gqa) | 187 | | [VCR](https://arxiv.org/abs/1811.10830) | Visual Reasoning & QA | MC | MTurks | 290 | [Site](https://visualcommonsense.com/) | 188 | | [VQAv2](https://arxiv.org/pdf/1505.00468) | Visual Reasoning & QA | Yes/No, Ans Match | MTurks | 1100 | [Repo](https://github.com/salesforce/LAVIS/blob/main/dataset_card/vqav2.md) | 189 | | [MMMU](https://arxiv.org/pdf/2311.16502) | Visual Reasoning & QA | Ans Match, MC | College | 11.5 | [Site](https://mmmu-benchmark.github.io/) | 190 | | [MMMU-Pro](https://arxiv.org/abs/2409.02813) | Visual Reasoning & QA | Ans Match, MC | College | 5.19 | [Site](https://mmmu-benchmark.github.io/) | 191 | | [R1‑Onevision](https://arxiv.org/pdf/2503.10615) | Visual Reasoning & QA | MC | Human | 155 | [Repo](https://github.com/Fancy-MLLM/R1-Onevision) | 192 | | [VLM²‑Bench](https://arxiv.org/pdf/2502.12084) | Visual Reasoning & QA | Ans Match, MC | Human | 3 | [Site](https://vlm2-bench.github.io/) | 193 | | [VisualWebInstruct](https://arxiv.org/pdf/2503.10582) | Visual Reasoning & QA | LLM Eval | Web | 0.9 | [Site](https://tiger-ai-lab.github.io/VisualWebInstruct/) | 194 | 195 | 196 | ### 📝 Visual Text / Document Understanding (+ Charts) 197 | 198 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 199 | |---------|------|---------------|------------|----------|-------------| 200 | | [TextVQA](https://arxiv.org/pdf/1904.08920) | Visual Text Understanding | Ans Match | Expert | 28.6 | [Repo](https://github.com/facebookresearch/mmf) | 201 | | [DocVQA](https://arxiv.org/pdf/2007.00398) | Document VQA | Ans Match | Crowd | 50 | [Site](https://www.docvqa.org/) | 202 | | [ChartQA](https://arxiv.org/abs/2203.10244) | Chart Graphic Understanding | Ans Match | Crowd / Synth | 32.7 | [Repo](https://github.com/vis-nlp/ChartQA) | 203 | 204 | 205 | ### 🌄 Text‑to‑Image Generation 206 | 207 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 208 | |---------|------|---------------|------------|----------|-------------| 209 | | [MSCOCO‑30K](https://arxiv.org/pdf/1405.0312) | Text‑to‑Image | BLEU, ROUGE, Sim | MTurks | 30 | [Site](https://cocodataset.org/#home) | 210 | | [GenAI‑Bench](https://arxiv.org/pdf/2406.13743) | Text‑to‑Image | Human Rating | Human | 80 | [HF](https://huggingface.co/datasets/BaiqiL/GenAI-Bench) | 211 | 212 | 213 | ### 🚨 Hallucination Detection / Control 214 | 215 | | Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site | 216 | |---------|------|---------------|------------|----------|-------------| 217 | | [HallusionBench](https://arxiv.org/pdf/2310.14566) | Hallucination | Yes/No | Human | 1.13 | [Repo](https://github.com/tianyi-lab/HallusionBench) | 218 | | [POPE](https://arxiv.org/pdf/2305.10355) | Hallucination | Yes/No | Human | 9 | [Repo](https://github.com/RUCAIBox/POPE) | 219 | | [CHAIR](https://arxiv.org/pdf/1809.02156) | Hallucination | Yes/No | Human | 124 | [Repo](https://github.com/LisaAnne/Hallucination) | 220 | | [MHalDetect](https://arxiv.org/abs/2308.06394) | Hallucination | Ans Match | Human | 4 | [Repo](https://github.com/LisaAnne/Hallucination) | 221 | | [Hallu‑Pi](https://arxiv.org/abs/2408.01355) | Hallucination | Ans Match | Human | 1.26 | [Repo](https://github.com/NJUNLP/Hallu-PI) | 222 | | [HallE‑Control](https://arxiv.org/abs/2310.01779) | Hallucination | Yes/No | Human | 108 | [Repo](https://github.com/bronyayang/HallE_Control) | 223 | | [AutoHallusion](https://arxiv.org/pdf/2406.10900) | Hallucination | Ans Match | Synth | 3.129 | [Repo](https://github.com/wuxiyang1996/AutoHallusion) | 224 | | [BEAF](https://arxiv.org/abs/2407.13442) | Hallucination | Yes/No | Human | 26 | [Site](https://beafbench.github.io/) | 225 | | [GAIVE](https://arxiv.org/abs/2306.14565) | Hallucination | Ans Match | Synth | 320 | [Repo](https://github.com/FuxiaoLiu/LRV-Instruction) | 226 | | [HalEval](https://arxiv.org/abs/2402.15721) | Hallucination | Yes/No | Crowd / Synth | 2 | [Repo](https://github.com/WisdomShell/hal-eval) | 227 | | [AMBER](https://arxiv.org/abs/2311.07397) | Hallucination | Ans Match | Human | 15.22 | [Repo](https://github.com/junyangwang0410/AMBER) | 228 | 229 | 230 | ### 2.3. Benchmark Datasets, Simulators, and Generative Models for Embodied VLM 231 | | Benchmark | Domain | Type | Project | 232 | |-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------:|:----------------------------------:|:----------------------------------------------------------------------------------------------------------------------:| 233 | | [Drive-Bench](https://arxiv.org/abs/2501.04003) | Embodied AI | Autonomous Driving | [Website](https://drive-bench.github.io) | 234 | | [Habitat](https://arxiv.org/pdf/1904.01201), [Habitat 2.0](https://arxiv.org/pdf/2106.14405), [Habitat 3.0](https://arxiv.org/pdf/2310.13724) | Robotics (Navigation) | Simulator + Dataset | [Website](https://aihabitat.org/) | 235 | | [Gibson](https://arxiv.org/pdf/1808.10654) | Robotics (Navigation) | Simulator + Dataset | [Website](http://gibsonenv.stanford.edu/), [Github Repo](https://github.com/StanfordVL/GibsonEnv) | 236 | | [iGibson1.0](https://arxiv.org/pdf/2012.02924), [iGibson2.0](https://arxiv.org/pdf/2108.03272) | Robotics (Navigation) | Simulator + Dataset | [Website](https://svl.stanford.edu/igibson/), [Document](https://stanfordvl.github.io/iGibson/) | 237 | | [Isaac Gym](https://arxiv.org/pdf/2108.10470) | Robotics (Navigation) | Simulator | [Website](https://developer.nvidia.com/isaac-gym), [Github Repo](https://github.com/isaac-sim/IsaacGymEnvs) | 238 | | [Isaac Lab](https://arxiv.org/pdf/2301.04195) | Robotics (Navigation) | Simulator | [Website](https://isaac-sim.github.io/IsaacLab/main/index.html), [Github Repo](https://github.com/isaac-sim/IsaacLab) | 239 | | [AI2THOR](https://arxiv.org/abs/1712.05474) | Robotics (Navigation) | Simulator | [Website](https://ai2thor.allenai.org/), [Github Repo](https://github.com/allenai/ai2thor) | 240 | | [ProcTHOR](https://arxiv.org/abs/2206.06994) | Robotics (Navigation) | Simulator + Dataset | [Website](https://procthor.allenai.org/), [Github Repo](https://github.com/allenai/procthor) | 241 | | [VirtualHome](https://arxiv.org/abs/1806.07011) | Robotics (Navigation) | Simulator | [Website](http://virtual-home.org/), [Github Repo](https://github.com/xavierpuigf/virtualhome) | 242 | | [ThreeDWorld](https://arxiv.org/abs/2007.04954) | Robotics (Navigation) | Simulator | [Website](https://www.threedworld.org/), [Github Repo](https://github.com/threedworld-mit/tdw) | 243 | | [VIMA-Bench](https://arxiv.org/pdf/2210.03094) | Robotics (Manipulation) | Simulator | [Website](https://vimalabs.github.io/), [Github Repo](https://github.com/vimalabs/VIMA) | 244 | | [VLMbench](https://arxiv.org/pdf/2206.08522) | Robotics (Manipulation) | Simulator | [Github Repo](https://github.com/eric-ai-lab/VLMbench) | 245 | | [CALVIN](https://arxiv.org/pdf/2112.03227) | Robotics (Manipulation) | Simulator | [Website](http://calvin.cs.uni-freiburg.de/), [Github Repo](https://github.com/mees/calvin) | 246 | | [GemBench](https://arxiv.org/pdf/2410.01345) | Robotics (Manipulation) | Simulator | [Website](https://www.di.ens.fr/willow/research/gembench/), [Github Repo](https://github.com/vlc-robot/robot-3dlotus/) | 247 | | [WebArena](https://arxiv.org/pdf/2307.13854) | Web Agent | Simulator | [Website](https://webarena.dev/), [Github Repo](https://github.com/web-arena-x/webarena) | 248 | | [UniSim](https://openreview.net/pdf?id=sFyTZEqmUY) | Robotics (Manipulation) | Generative Model, World Model | [Website](https://universal-simulator.github.io/unisim/) | 249 | | [GAIA-1](https://arxiv.org/pdf/2309.17080) | Robotics (Automonous Driving) | Generative Model, World Model | [Website](https://wayve.ai/thinking/introducing-gaia1/) | 250 | | [LWM](https://arxiv.org/pdf/2402.08268) | Embodied AI | Generative Model, World Model | [Website](https://largeworldmodel.github.io/lwm/), [Github Repo](https://github.com/LargeWorldModel/LWM) | 251 | | [Genesis](https://github.com/Genesis-Embodied-AI/Genesis) | Embodied AI | Generative Model, World Model | [Github Repo](https://github.com/Genesis-Embodied-AI/Genesis) | 252 | | [EMMOE](https://arxiv.org/pdf/2503.08604) | Embodied AI | Generative Model, World Model | [Paper](https://arxiv.org/pdf/2503.08604) | 253 | | [RoboGen](https://arxiv.org/pdf/2311.01455) | Embodied AI | Generative Model, World Model | [Website](https://robogen-ai.github.io/) | 254 | | [UnrealZoo](https://arxiv.org/abs/2412.20977) | Embodied AI (Tracking, Navigation, Multi Agent)| Simulator | [Website](http://unrealzoo.site/) | 255 | 256 | 257 | ## 3. ⚒️ Post-Training 258 | ### 3.1. RL Alignment for VLM 259 | | Title | Year | Paper | RL | Code | 260 | |----------------|------|--------|---------|------| 261 | | Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning | 10/12/2025 | [Paper](https://arxiv.org/abs/2505.13886) | GRPO | - | 262 | | Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | 09/29/2025 | [Paper](https://www.arxiv.org/abs/2509.25541) | GRPO | - | 263 | | Vision-SR1: Self-rewarding vision-language model via reasoning decomposition | 08/26/2025 | [Paper](https://arxiv.org/abs/2508.19652) | GRPO | - | 264 | | Group Sequence Policy Optimization | 06/24/2025 | [Paper](https://www.arxiv.org/abs/2507.18071) | GSPO | - | 265 | | Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning | 05/20/2025 | [Paper](https://arxiv.org/abs/2505.14677) | GRPO | - | 266 | | VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | 2025/04/10 | [Paper](https://arxiv.org/abs/2504.06958) | GRPO | [Code](https://github.com/OpenGVLab/VideoChat-R1) | 267 | | OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement | 2025/03/21 | [Paper](https://arxiv.org/abs/2503.17352) | GRPO | [Code](https://github.com/yihedeng9/OpenVLThinker) | 268 | | Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | 2025/03/10 | [Paper](https://arxiv.org/abs/2503.07065) | GRPO | [Code](https://github.com/ding523/Curr_REFT) | 269 | | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference | 2025 | [Paper](https://arxiv.org/abs/2502.18411) | DPO | [Code](https://github.com/PhoenixZ810/OmniAlign-V) | 270 | | Multimodal Open R1/R1-Multimodal-Journey | 2025 | - | GRPO | [Code](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) | 271 | | R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization | 2025 | [Paper](https://arxiv.org/abs/2503.12937) | GRPO | [Code](https://github.com/jingyi0000/R1-VL) | 272 | | Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning | 2025 | - | PPO/REINFORCE++/GRPO | [Code](https://github.com/0russwest0/Agent-R1) | 273 | | MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | 2025 | [Paper](https://arxiv.org/abs/2503.07365) | [REINFORCE Leave-One-Out (RLOO)](https://openreview.net/pdf?id=r1lgTGL5DE) | [Code](https://github.com/ModalMinds/MM-EUREKA) | 274 | | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | 2025 | [Paper](https://arxiv.org/abs/2502.10391) | DPO | [Code](https://github.com/Kwai-YuanQi/MM-RLHF) | 275 | | LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | 2025 | [Paper](https://arxiv.org/pdf/2503.07536) | PPO | [Code](https://github.com/TideDra/lmm-r1) | 276 | | Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | 2025 | [Paper](https://arxiv.org/pdf/2503.06749) | GRPO | [Code](https://github.com/Osilly/Vision-R1) | 277 | | Unified Reward Model for Multimodal Understanding and Generation | 2025 | [Paper](https://arxiv.org/abs/2503.05236) | DPO | [Code](https://github.com/CodeGoat24/UnifiedReward) | 278 | | Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | 2025 | [Paper](https://arxiv.org/pdf/2501.13926) | DPO | [Code](https://github.com/ZiyuGuo99/Image-Generation-CoT) | 279 | | All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning | 2025 | [Paper](https://arxiv.org/pdf/2503.01067) | Online RL | - | 280 | | Video-R1: Reinforcing Video Reasoning in MLLMs | 2025 | [Paper](https://arxiv.org/abs/2503.21776) | GRPO | [Code](https://github.com/tulerfeng/Video-R1) | 281 | 282 | ### 3.2. Finetuning for VLM 283 | | Title | Year | Paper | Website | Code | 284 | |----------------|------|--------|---------|------| 285 | | Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | 2025/04/21 | [Paper](https://arxiv.org/abs/2504.15271) | [Website](https://nvlabs.github.io/EAGLE/) | [Code](https://github.com/NVlabs/EAGLE) | 286 | | OMNICAPTIONER: One Captioner to Rule Them All | 2025/04/09 | [Paper](https://arxiv.org/abs/2504.07089) | [Website](https://alpha-innovator.github.io/OmniCaptioner-project-page/) | [Code](https://github.com/Alpha-Innovator/OmniCaptioner) | 287 | | Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | [Paper](https://arxiv.org/abs/2412.03565) | [Website](https://github.com/Alpha-Innovator/OmniCaptioner) | [Code](https://github.com/inst-it/inst-it) | 288 | | LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression | 2024 | [Paper](https://arxiv.org/pdf/2406.20092) | [Website](https://beckschen.github.io/llavolta.html) | [Code](https://github.com/Beckschen/LLaVolta) | 289 | | ViTamin: Designing Scalable Vision Models in the Vision-Language Era | 2024 | [Paper](https://arxiv.org/pdf/2404.02132) | [Website](https://beckschen.github.io/vitamin.html) | [Code](https://github.com/Beckschen/ViTamin) | 290 | | Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | 2024 | [Paper](https://arxiv.org/pdf/2412.04729) | - | - | 291 | | Should VLMs be Pre-trained with Image Data? | 2025 | [Paper](https://arxiv.org/pdf/2503.07603) | - | - | 292 | | VisionArena: 230K Real World User-VLM Conversations with Preference Labels | 2024 | [Paper](https://arxiv.org/pdf/2412.08687) | - | [Code](https://huggingface.co/lmarena-ai) | 293 | 294 | ### 3.3. VLM Alignment github 295 | | Project | Repository Link | 296 | |----------------|----------------| 297 | |Verl|[🔗 GitHub](https://github.com/volcengine/verl) | 298 | |EasyR1|[🔗 GitHub](https://github.com/hiyouga/EasyR1) | 299 | |OpenR1|[🔗 GitHub](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) | 300 | | LLaMAFactory | [🔗 GitHub](https://github.com/hiyouga/LLaMA-Factory) | 301 | | MM-Eureka-Zero | [🔗 GitHub](https://github.com/ModalMinds/MM-EUREKA/tree/main) | 302 | | MM-RLHF | [🔗 GitHub](https://github.com/Kwai-YuanQi/MM-RLHF) | 303 | | LMM-R1 | [🔗 GitHub](https://github.com/TideDra/lmm-r1) | 304 | 305 | ### 3.4. Prompt Optimization 306 | | Title | Year | Paper | Website | Code | 307 | |----------------|------|--------|---------|------| 308 | | In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer | 2025/04/30 | [Paper](https://arxiv.org/abs/2504.20690) | [Website](https://river-zhang.github.io/ICEdit-gh-pages/) | [Code](https://github.com/River-Zhang/ICEdit) | 309 | 310 | ## 4. ⚒️ Applications 311 | 312 | ### 4.1 Embodied VLM Agents 313 | 314 | | Title | Year | Paper Link | 315 | |----------------|------|------------| 316 | | Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI | 2024 | [Paper](https://arxiv.org/pdf/2407.06886v1) | 317 | | ScreenAI: A Vision-Language Model for UI and Infographics Understanding | 2024 | [Paper](https://arxiv.org/pdf/2402.04615) | 318 | | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | 2023 | [Paper](https://arxiv.org/pdf/2311.16483) | 319 | | SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement | 2024 | [📄 Paper](https://arxiv.org/pdf/2409.19242) | 320 | | Training a Vision Language Model as Smartphone Assistant | 2024 | [Paper](https://arxiv.org/pdf/2404.08755) | 321 | | ScreenAgent: A Vision-Language Model-Driven Computer Control Agent | 2024 | [Paper](https://arxiv.org/pdf/2402.07945) | 322 | | Embodied Vision-Language Programmer from Environmental Feedback | 2024 | [Paper](https://arxiv.org/pdf/2310.08588) | 323 | | VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method | 2025 | [📄 Paper](https://arxiv.org/abs/2503.05383) | - | [💾 Code](https://github.com/camel-ai/VLM-Play-StarCraft2) | 324 | | MP-GUI: Modality Perception with MLLMs for GUI Understanding | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.14021) | - | [💾 Code](https://github.com/BigTaige/MP-GUI) | 325 | 326 | 327 | ### 4.2. Generative Visual Media Applications 328 | | Title | Year | Paper | Website | Code | 329 | |----------------|------|--------|---------|------| 330 | | GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | 2023 | [📄 Paper](https://arxiv.org/pdf/2311.12631) | [🌍 Website](https://gpt4motion.github.io/) | [💾 Code](https://github.com/jiaxilv/GPT4Motion) | 331 | | Spurious Correlation in Multimodal LLMs | 2025 | [📄 Paper](https://arxiv.org/abs/2503.08884) | - | - | 332 | | WeGen: A Unified Model for Interactive Multimodal Generation as We Chat | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.01115) | - | [💾 Code](https://github.com/hzphzp/WeGen) | 333 | | VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.13444) | [🌍 Website](https://videomind.github.io/) | [💾 Code](https://github.com/yeliudev/VideoMind) | 334 | 335 | ### 4.3. Robotics and Embodied AI 336 | | Title | Year | Paper | Website | Code | 337 | |----------------|------|--------|---------|------| 338 | | AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.00371) | [🌍 Website](https://aha-vlm.github.io/) | - | 339 | | SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | [📄 Paper](https://arxiv.org/pdf/2401.12168) | [🌍 Website](https://spatial-vlm.github.io/) | - | 340 | | Vision-language model-driven scene understanding and robotic object manipulation | 2024 | [📄 Paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10711845&casa_token=to4vCckCewMAAAAA:2ykeIrubUOxwJ1rhwwakorQFAwUUBQhL_Ct7dnYBceWU5qYXiCoJp_yQkmJbmtiEVuX2jcpvB92n&tag=1) | - | - | 341 | | Guiding Long-Horizon Task and Motion Planning with Vision Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.02193) | [🌍 Website](https://zt-yang.github.io/vlm-tamp-robot/) | - | 342 | | AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers | 2023 | [📄 Paper](https://arxiv.org/pdf/2306.06531) | [🌍 Website](https://yongchao98.github.io/MIT-REALM-AutoTAMP/) | - | 343 | | VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.08792) | - | - | 344 | | Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? | 2023 | [📄 Paper](https://arxiv.org/pdf/2309.15943) | [🌍 Website](https://yongchao98.github.io/MIT-REALM-Multi-Robot/) | - | 345 | | DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2411.09022) | [🌍 Website](https://wyd0817.github.io/project-dart-llm/) | - | 346 | | MotionGPT: Human Motion as a Foreign Language | 2023 | [📄 Paper](https://proceedings.neurips.cc/paper_files/paper/2023/file/3fbf0c1ea0716c03dea93bb6be78dd6f-Paper-Conference.pdf) | - | [💾 Code](https://github.com/OpenMotionLab/MotionGPT) | 347 | | Learning Reward for Robot Skills Using Large Language Models via Self-Alignment | 2024 | [📄 Paper](https://arxiv.org/pdf/2405.07162) | - | - | 348 | | Language to Rewards for Robotic Skill Synthesis | 2023 | [📄 Paper](https://language-to-reward.github.io/assets/l2r.pdf) | [🌍 Website](https://language-to-reward.github.io/) | - | 349 | | Eureka: Human-Level Reward Design via Coding Large Language Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.12931) | [🌍 Website](https://eureka-research.github.io/) | - | 350 | | Integrated Task and Motion Planning | 2020 | [📄 Paper](https://arxiv.org/pdf/2010.01083) | - | - | 351 | | Jailbreaking LLM-Controlled Robots | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.13691) | [🌍 Website](https://robopair.org/) | - | 352 | | Robots Enact Malignant Stereotypes | 2022 | [📄 Paper](https://arxiv.org/pdf/2207.11569) | [🌍 Website](https://sites.google.com/view/robots-enact-stereotypes) | - | 353 | | LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions | 2024 | [📄 Paper](https://arxiv.org/pdf/2406.08824) | - | - | 354 | | Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.10340) | [🌍 Website](https://wuxiyang1996.github.io/adversary-vlm-robotics/) | - | 355 | | EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | 2025 | [📄 Paper](https://arxiv.org/pdf/2502.09560) | [🌍 Website](https://embodiedbench.github.io/) | [💾 Code & Dataset](https://github.com/EmbodiedBench/EmbodiedBench) | 356 | | Gemini Robotics: Bringing AI into the Physical World | 2025 | [📄 Technical Report](https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf) | [🌍 Website](https://deepmind.google/technologies/gemini-robotics/) | - | 357 | | GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.06158) | [🌍 Website](https://gr2-manipulation.github.io/) | - | 358 | | Magma: A Foundation Model for Multimodal AI Agents | 2025 | [📄 Paper](https://arxiv.org/pdf/2502.13130) | [🌍 Website](https://microsoft.github.io/Magma/) | [💾 Code](https://github.com/microsoft/Magma) | 359 | | DayDreamer: World Models for Physical Robot Learning | 2022 | [📄 Paper](https://arxiv.org/pdf/2206.14176)| [🌍 Website](https://danijar.com/project/daydreamer/) | [💾 Code](https://github.com/danijar/daydreamer) | 360 | | Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | 2025 | [📄 Paper](https://arxiv.org/pdf/2206.14176)| - | - | 361 | | RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.03681)| [🌍 Website](https://rlvlmf2024.github.io/) | [💾 Code](https://github.com/yufeiwang63/RL-VLM-F) | 362 | | KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | [📄 Paper](https://arxiv.org/pdf/2409.14066)| [🌍 Website](https://kalie-vlm.github.io/) | [💾 Code](https://github.com/gractang/kalie) | 363 | | Unified Video Action Model | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.00200)| [🌍 Website](https://unified-video-action-model.github.io/) | [💾 Code](https://github.com/ShuangLI59/unified_video_action) | 364 | | HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model | 2025 | [📄 Paper](https://arxiv.org/abs/2503.10631)| [🌍 Website](https://hybrid-vla.github.io/) | [💾 Code](https://github.com/PKU-HMI-Lab/Hybrid-VLA) | 365 | 366 | #### 4.3.1. Manipulation 367 | | Title | Year | Paper | Website | Code | 368 | |----------------|------|--------|---------|------| 369 | | VIMA: General Robot Manipulation with Multimodal Prompts | 2022 | [📄 Paper](https://arxiv.org/pdf/2210.03094) | [🌍 Website](https://vimalabs.github.io/) | 370 | | Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model | 2023 | [📄 Paper](https://arxiv.org/pdf/2305.11176) | - | - | 371 | | Creative Robot Tool Use with Large Language Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.13065) | [🌍 Website](https://creative-robotool.github.io/) | - | 372 | | RoboVQA: Multimodal Long-Horizon Reasoning for Robotics | 2024 | [📄 Paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10610216) | - | - | 373 | | RT-1: Robotics Transformer for Real-World Control at Scale | 2022 | [📄 Paper](https://robotics-transformer1.github.io/assets/rt1.pdf) | [🌍 Website](https://robotics-transformer1.github.io/) | - | 374 | | RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | 2023 | [📄 Paper](https://arxiv.org/pdf/2307.15818) | [🌍 Website](https://robotics-transformer2.github.io/) | - | 375 | | Open X-Embodiment: Robotic Learning Datasets and RT-X Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.08864) | [🌍 Website](https://robotics-transformer-x.github.io/) | - | 376 | | ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2403.09583) | [🌍 Website](https://explorllm.github.io/) | - | 377 | | AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors | 2025 | [📄 Paper](https://arxiv.org/pdf/2502.12191) | [🌍 Website](https://gewu-lab.github.io/AnyTouch/) | [💾 Code](https://github.com/GeWu-Lab/AnyTouch) | 378 | | Masked World Models for Visual Control | 2022 | [📄 Paper](https://arxiv.org/pdf/2206.14244)| [🌍 Website](https://sites.google.com/view/mwm-rl) | [💾 Code](https://github.com/younggyoseo/MWM) | 379 | | Multi-View Masked World Models for Visual Robotic Manipulation | 2023 | [📄 Paper](https://arxiv.org/pdf/2302.02408)| [🌍 Website](https://sites.google.com/view/mv-mwm) | [💾 Code](https://github.com/younggyoseo/MV-MWM) | 380 | 381 | 382 | #### 4.3.2. Navigation 383 | | Title | Year | Paper | Website | Code | 384 | |----------------|------|--------|---------|------| 385 | | ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings | 2022 | [📄 Paper](https://arxiv.org/pdf/2206.12403) | - | - | 386 | | LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation | 2024 | [📄 Paper](https://arxiv.org/pdf/2405.05363) | - | - | 387 | | LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action | 2022 | [📄 Paper](https://arxiv.org/pdf/2207.04429) | [🌍 Website](https://sites.google.com/view/lmnav) | - | 388 | | NaVILA: Legged Robot Vision-Language-Action Model for Navigation | 2022 | [📄 Paper](https://arxiv.org/pdf/2412.04453) | [🌍 Website](https://navila-bot.github.io/) | - | 389 | | VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation | 2024 | [📄 Paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10610712&casa_token=qvFCSt20n0MAAAAA:MSC4P7bdlfQuMRFrmIl706B-G8ejcxH9ZKROKETL1IUZIW7m_W4hKW-kWrxw-F8nykoysw3WYHnd) | - | - | 390 | | Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.10103) | [🌍 Website](https://sites.google.com/view/lfg-nav/) | - | 391 | | Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.09820) | - | - | 392 | | Navigation World Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2412.03572) | [🌍 Website](https://www.amirbar.net/nwm/) | - | 393 | 394 | 395 | #### 4.3.3. Human-robot Interaction 396 | | Title | Year | Paper | Website | Code | 397 | |----------------|------|--------|---------|------| 398 | | MUTEX: Learning Unified Policies from Multimodal Task Specifications | 2023 | [📄 Paper](https://arxiv.org/pdf/2309.14320) | [🌍 Website](https://ut-austin-rpl.github.io/MUTEX/) | - | 399 | | LaMI: Large Language Models for Multi-Modal Human-Robot Interaction | 2024 | [📄 Paper](https://arxiv.org/pdf/2401.15174) | [🌍 Website](https://hri-eu.github.io/Lami/) | - | 400 | | VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2404.00210) | - | - | 401 | 402 | #### 4.3.4. Autonomous Driving 403 | | Title | Year | Paper | Website | Code | 404 | |----------------|------|--------|---------|------| 405 | | Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives | 01/07/2025 | [📄 Paper](https://arxiv.org/abs/2501.04003) | [🌍 Website](drive-bench.github.io) | - | 406 | | DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | 2024 | [📄 Paper](https://arxiv.org/abs/2402.12289) | [🌍 Website](https://tsinghua-mars-lab.github.io/DriveVLM/) | - | 407 | | GPT-Driver: Learning to Drive with GPT | 2023 | [📄 Paper](https://arxiv.org/abs/2310.01415) | - | - | 408 | | LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | 2023 | [📄 Paper](https://arxiv.org/abs/2310.03026) | [🌍 Website](https://sites.google.com/view/llm-mpc) | - | 409 | | Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving | 2023 | [📄 Paper](https://arxiv.org/abs/2310.01957) | - | - | 410 | | Referring Multi-Object Tracking | 2023 | [📄 Paper](https://arxiv.org/pdf/2303.03366) | - | [💾 Code](https://github.com/wudongming97/RMOT) | 411 | | VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision | 2023 | [📄 Paper](https://arxiv.org/pdf/2304.03135) | - | [💾 Code](https://github.com/lmy98129/VLPD) | 412 | | MotionLM: Multi-Agent Motion Forecasting as Language Modeling | 2023 | [📄 Paper](https://arxiv.org/pdf/2309.16534) | - | - | 413 | | DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models | 2023 | [📄 Paper](https://arxiv.org/abs/2309.16292) | [🌍 Website](https://pjlab-adg.github.io/DiLu/) | - | 414 | | VLP: Vision Language Planning for Autonomous Driving | 2024 | [📄 Paper](https://arxiv.org/pdf/2401.05577) | - | - | 415 | | DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | 2023 | [📄 Paper](https://arxiv.org/abs/2310.01412) | - | - | 416 | 417 | 418 | ### 4.4. Human-Centered AI 419 | | Title | Year | Paper | Website | Code | 420 | |----------------|------|--------|---------|------| 421 | | DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis | 2024 | [📄 Paper](https://arxiv.org/pdf/2412.12225) | - | [💾 Code](https://github.com/pwang322/DLF) | 422 | | LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application | 2024 | [📄 Paper](https://arxiv.org/abs/2406.13787) | - | - | 423 | | Pretrained Language Models as Visual Planners for Human Assistance | 2023 | [📄 Paper](https://arxiv.org/pdf/2304.09179) | - | - | 424 | | Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research | 2024 | [📄 Paper](https://arxiv.org/pdf/2405.08668) | - | - | 425 | | Image and Data Mining in Reticular Chemistry Using GPT-4V | 2023 | [📄 Paper](https://arxiv.org/pdf/2312.05468) | - | - | 426 | 427 | #### 4.4.1. Web Agent 428 | | Title | Year | Paper | Website | Code | 429 | |----------------|------|--------|---------|------| 430 | | A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 2023 | [📄 Paper](https://arxiv.org/pdf/2307.12856) | - | - | 431 | | CogAgent: A Visual Language Model for GUI Agents | 2023 | [📄 Paper](https://arxiv.org/pdf/2312.08914) | - | [💾 Code](https://github.com/THUDM/CogAgent) | 432 | | WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2401.13919) | - | [💾 Code](https://github.com/MinorJerry/WebVoyager) | 433 | | ShowUI: One Vision-Language-Action Model for GUI Visual Agent | 2024 | [📄 Paper](https://arxiv.org/pdf/2411.17465) | - | [💾 Code](https://github.com/showlab/ShowUI) | 434 | | ScreenAgent: A Vision Language Model-driven Computer Control Agent | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.07945) | - | [💾 Code](https://github.com/niuzaisheng/ScreenAgent) | 435 | | Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.13232) | - | [💾 Code](https://huggingface.co/papers/2410.13232) | 436 | 437 | 438 | #### 4.4.2. Accessibility 439 | | Title | Year | Paper | Website | Code | 440 | |----------------|------|--------|---------|------| 441 | | X-World: Accessibility, Vision, and Autonomy Meet | 2021 | [📄 Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhang_X-World_Accessibility_Vision_and_Autonomy_Meet_ICCV_2021_paper.pdf) | - | - | 442 | | Context-Aware Image Descriptions for Web Accessibility | 2024 | [📄 Paper](https://arxiv.org/pdf/2409.03054) | - | - | 443 | | Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models | 2024 | [📄 Paper](https://dl.acm.org/doi/10.1145/3691573.3691619) | - | - 444 | 445 | 446 | #### 4.4.3. Healthcare 447 | | Title | Year | Paper | Website | Code | 448 | |----------------|------|--------|---------|------| 449 | | VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge | 2024 | [📄 Paper](https://arxiv.org/pdf/2408.02865) | - | [💾 Code](https://github.com/HUANGLIZI/VisionUnite) | 450 | | Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.14252) | - | - | 451 | | M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization | 2023 | [📄 Paper](https://arxiv.org/pdf/2307.08347) | - | - | 452 | | MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | 2022 | [📄 Paper](https://arxiv.org/pdf/2210.10163) | - | [💾 Code](https://github.com/RyanWangZf/MedCLIP) | 453 | | Med-Flamingo: A Multimodal Medical Few-Shot Learner | 2023 | [📄 Paper](https://arxiv.org/pdf/2307.15189) | - | [💾 Code](https://github.com/snap-stanford/med-flamingo) | 454 | 455 | 456 | #### 4.4.4. Social Goodness 457 | | Title | Year | Paper | Website | Code | 458 | |----------------|------|--------|---------|------| 459 | | Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy | 2024 | [📄 Paper](https://www.sciencedirect.com/science/article/pii/S2666920X24000985) | - | - | 460 | | Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.15701) | - | - | 461 | | Harnessing Large Vision and Language Models in Agriculture: A Review | 2024 | [📄 Paper](https://arxiv.org/pdf/2407.19679) | - | - | 462 | | A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping | 2024 | [📄 Paper](https://www.frontiersin.org/journals/environmental-science/articles/10.3389/fenvs.2024.1515752/abstract) | - | - | 463 | | Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2407.09043) | - | [💾 Code](https://github.com/Namkyeong/AMOLE) | 464 | | DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images | 2024 | [📄 Paper](https://openreview.net/pdf?id=0vQYvcinij) | - | - | 465 | | MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2409.00147) | - | [💾 Code](https://github.com/pengshuai-rin/MultiMath) | 466 | | Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | 2024 | [📄 Paper](https://arxiv.org/pdf/2406.09838) | - | [💾 Code](https://github.com/AlexJJJChen/Climate-Zoo) | 467 | | He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation | 2021 | [📄 Paper](https://aclanthology.org/2021.findings-acl.397.pdf) | - | - | 468 | | UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling | 2024 | [📄 Paper](https://arxiv.org/pdf/2403.168318) | - | - | 469 | 470 | 471 | ## 5. Challenges 472 | ### 5.1 Hallucination 473 | | Title | Year | Paper | Website | Code | 474 | |----------------|------|--------|---------|------| 475 | | Object Hallucination in Image Captioning | 2018 | [📄 Paper](https://arxiv.org/pdf/1809.02156) | - | - | 476 | | Evaluating Object Hallucination in Large Vision-Language Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2305.10355) | - | [💾 Code](https://github.com/RUCAIBox/POPE) | 477 | | Detecting and Preventing Hallucinations in Large Vision Language Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2308.06394) | - | - | 478 | | HallE-Control: Controlling Object Hallucination in Large Multimodal Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.01779) | - | [💾 Code](https://github.com/bronyayang/HallE_Control) | 479 | | Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs | 2024 | [📄 Paper](https://arxiv.org/pdf/2408.01355) | - | [💾 Code](https://github.com/NJUNLP/Hallu-PI) | 480 | | BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2407.13442) | [🌍 Website](https://beafbench.github.io/) | - | 481 | | HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.14566) | - | [💾 Code](https://github.com/tianyi-lab/HallusionBench) | 482 | | AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2406.10900) | [🌍 Website](https://wuxiyang1996.github.io/autohallusion_page/) | - | 483 | | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | 2023 | [📄 Paper](https://arxiv.org/pdf/2306.14565) | - | [💾 Code](https://github.com/FuxiaoLiu/LRV-Instruction) | 484 | | Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.15721) | - | [💾 Code](https://github.com/WisdomShell/hal-eval) | 485 | | AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | 2023 | [📄 Paper](https://arxiv.org/pdf/2311.07397) | - | [💾 Code](https://github.com/junyangwang0410/AMBER) | 486 | 487 | 488 | ### 5.2 Safety 489 | | Title | Year | Paper | Website | Code | 490 | |----------------|------|--------|---------|------| 491 | | JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2407.01599) | [🌍 Website](https://chonghan-chen.com/llm-jailbreak-zoo-survey/) | - | 492 | | Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments | 2023 | [📄 Paper](https://arxiv.org/pdf/2311.02817) | - | - | 493 | | SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.18927) | - | - | 494 | | JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | 2024 | [📄 Paper](https://arxiv.org/pdf/2404.03027) | - | - | 495 | | SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.04178) | - | [💾 Code](https://github.com/laiyingxin2/SHIELD) | 496 | | Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2403.09792) | - | - | 497 | | Jailbreaking Attack against Multimodal Large Language Model | 2024 | [📄 Paper](https://arxiv.org/pdf/2402.02309) | - | - | 498 | | Embodied Red Teaming for Auditing Robotic Foundation Models | 2025 | [📄 Paper](https://arxiv.org/pdf/2411.18676) | [🌍 Website](https://s-karnik.github.io/embodied-red-team-project-page/) | [💾 Code](https://github.com/Improbable-AI/embodied-red-teaming) | 499 | | Safety Guardrails for LLM-Enabled Robots | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.07885) | - | - | 500 | 501 | 502 | ### 5.3 Fairness 503 | | Title | Year | Paper | Website | Code | 504 | |----------------|------|--------|---------|------| 505 | | Hallucination of Multimodal Large Language Models: A Survey | 2024 | [📄 Paper](https://arxiv.org/pdf/2404.18930) | - | - | 506 | | Bias and Fairness in Large Language Models: A Survey | 2023 | [📄 Paper](https://arxiv.org/pdf/2309.00770) | - | - | 507 | | Fairness and Bias in Multimodal AI: A Survey | 2024 | [📄 Paper](https://arxiv.org/pdf/2406.19097) | - | - | 508 | | Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models | 2023 | [📄 Paper](http://gerard.demelo.org/papers/multimodal-bias.pdf) | - | - | 509 | | FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.01089) | - | - | 510 | | FairCLIP: Harnessing Fairness in Vision-Language Learning | 2024 | [📄 Paper](https://arxiv.org/pdf/2403.19949) | - | - | 511 | | FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2407.00983) | - | - | 512 | | Benchmarking Vision Language Models for Cultural Understanding | 2024 | [📄 Paper](https://arxiv.org/pdf/2407.10920) | - | - | 513 | 514 | #### 5.4 Alignment 515 | #### 5.4.1 Multi-modality Alignment 516 | | Title | Year | Paper | Website | Code | 517 | |----------------|------|--------|---------|------| 518 | | Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | 2024 | [📄 Paper](https://arxiv.org/pdf/2403.18715) | - | - | 519 | | Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | 2024 | [📄 Paper](https://arxiv.org/pdf/2405.15973) | - | - | 520 | | Assessing and Learning Alignment of Unimodal Vision and Language Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2412.04616) | [🌍 Website](https://lezhang7.github.io/sail.github.io/) | - | 521 | | Extending Multi-modal Contrastive Representations | 2023 | [📄 Paper](https://arxiv.org/pdf/2310.08884) | - | [💾 Code](https://github.com/MCR-PEFT/Ex-MCR) | 522 | | OneLLM: One Framework to Align All Modalities with Language | 2023 | [📄 Paper](https://arxiv.org/pdf/2312.03700) | - | [💾 Code](https://github.com/csuhan/OneLLM) | 523 | | What You See is What You Read? Improving Text-Image Alignment Evaluation | 2023 | [📄 Paper](https://arxiv.org/pdf/2305.10400) | [🌍 Website](https://wysiwyr-itm.github.io/) | [💾 Code](https://github.com/yonatanbitton/wysiwyr) | 524 | | Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning | 2024 | [📄 Paper](https://arxiv.org/pdf/2411.18203) | [🌍 Website](https://huggingface.co/papers/2411.18203) | [💾 Code](https://github.com/kyrieLei/Critic-V) | 525 | 526 | #### 5.4.2 Commonsense and Physics Alignment 527 | | Title | Year | Paper | Website | Code | 528 | |----------------|------|--------|---------|------| 529 | | VBench: Comprehensive BenchmarkSuite for Video Generative Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2311.17982) | [🌍 Website](https://vchitect.github.io/VBench-project/) | [💾 Code](https://github.com/Vchitect/VBench) | 530 | | VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | 2024 | [📄 Paper](https://arxiv.org/pdf/2411.13503) | [🌍 Website](https://vchitect.github.io/VBench-project/) | [💾 Code](https://github.com/Vchitect/VBench) | 531 | | PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding | 2025 | [📄 Paper](https://arxiv.org/pdf/2501.16411) | [🌍 Website](https://physbench.github.io/) | [💾 Code](https://github.com/USC-GVL/PhysBench) | 532 | | VideoPhy: Evaluating Physical Commonsense for Video Generation | 2024 | [📄 Paper](https://arxiv.org/pdf/2406.03520) | [🌍 Website](https://videophy.github.io/) | [💾 Code](https://github.com/Hritikbansal/videophy) | 533 | | WorldSimBench: Towards Video Generation Models as World Simulators | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.18072) | [🌍 Website](https://iranqin.github.io/WorldSimBench.github.io/) | - | 534 | | WorldModelBench: Judging Video Generation Models As World Models | 2025 | [📄 Paper](https://arxiv.org/pdf/2502.20694) | [🌍 Website](https://worldmodelbench-team.github.io/) | [💾 Code](https://github.com/WorldModelBench-Team/WorldModelBench/tree/main?tab=readme-ov-file) | 535 | | VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation | 2024 | [📄 Paper](https://arxiv.org/pdf/2406.15252) | [🌍 Website](https://tiger-ai-lab.github.io/VideoScore/) | [💾 Code](https://github.com/TIGER-AI-Lab/VideoScore) | 536 | | WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.07265) | - | [💾 Code](https://github.com/PKU-YuanGroup/WISE) | 537 | | Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency | 2025 | [📄 Paper](https://arxiv.org/pdf/2502.04076) | - | [💾 Code](https://github.com/littlespray/CRAVE) | 538 | | Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.06287) | - | - | 539 | | SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | [📄 Paper](https://arxiv.org/pdf/2401.12168) | [🌍 Website](https://spatial-vlm.github.io/) | [💾 Code](https://github.com/remyxai/VQASynth) | 540 | | Do generative video models understand physical principles? | 2025 | [📄 Paper](https://arxiv.org/pdf/2501.09038) | [🌍 Website](https://physics-iq.github.io/) | [💾 Code](https://github.com/google-deepmind/physics-IQ-benchmark) | 541 | | PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation | 2024 | [📄 Paper](https://arxiv.org/pdf/2409.18964) | [🌍 Website](https://stevenlsw.github.io/physgen/) | [💾 Code](https://github.com/stevenlsw/physgen) | 542 | | How Far is Video Generation from World Model: A Physical Law Perspective | 2024 | [📄 Paper](https://arxiv.org/pdf/2411.02385) | [🌍 Website](https://phyworld.github.io/) | [💾 Code](https://github.com/phyworld/phyworld) | 543 | | Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | 2025 | [📄 Paper](https://arxiv.org/abs/2501.07542) | - | - | 544 | | VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness | 2025 | [📄 Paper](https://arxiv.org/pdf/2503.21755) | [🌍 Website](https://vchitect.github.io/VBench-2.0-project/) | [💾 Code](https://github.com/Vchitect/VBench) | 545 | 546 | ### 5.5 Efficient Training and Fine-Tuning 547 | | Title | Year | Paper | Website | Code | 548 | |----------------|------|--------|---------|------| 549 | | VILA: On Pre-training for Visual Language Models | 2023 | [📄 Paper](https://arxiv.org/pdf/2312.07533) | - | - | 550 | | SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | 2021 | [📄 Paper](https://arxiv.org/pdf/2108.10904) | - | - | 551 | | LoRA: Low-Rank Adaptation of Large Language Models | 2021 | [📄 Paper](https://arxiv.org/pdf/2106.09685) | - | [💾 Code](https://github.com/microsoft/LoRA) | 552 | | QLoRA: Efficient Finetuning of Quantized LLMs | 2023 | [📄 Paper](https://arxiv.org/pdf/2305.14314) | - | - | 553 | | Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | 2022 | [📄 Paper](https://arxiv.org/pdf/2204.05862) | - | [💾 Code](https://github.com/anthropics/hh-rlhf) | 554 | | RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback | 2023 | [📄 Paper](https://arxiv.org/pdf/2309.00267) | - | - | 555 | 556 | 557 | ### 5.6 Scarce of High-quality Dataset 558 | | Title | Year | Paper | Website | Code | 559 | |----------------|------|--------|---------|------| 560 | | A Survey on Bridging VLMs and Synthetic Data | 2025 | [📄 Paper](https://openreview.net/pdf?id=ThjDCZOljE) | - | [💾 Code](https://github.com/mghiasvand1/Awesome-VLM-Synthetic-Data/) | 561 | | Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | [📄 Paper](https://arxiv.org/abs/2412.03565) | [Website](https://inst-it.github.io/) | [💾 Code](https://github.com/inst-it/inst-it) | 562 | | SLIP: Self-supervision meets Language-Image Pre-training | 2021 | [📄 Paper](https://arxiv.org/pdf/2112.12750) | - | [💾 Code](https://github.com/facebookresearch/SLIP) | 563 | | Synthetic Vision: Training Vision-Language Models to Understand Physics | 2024 | [📄 Paper](https://arxiv.org/pdf/2412.08619) | - | - | 564 | | Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings | 2024 | [📄 Paper](https://arxiv.org/pdf/2403.07750) | - | - | 565 | | KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | [📄 Paper](https://arxiv.org/pdf/2409.14066) | - | - | 566 | | Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | [📄 Paper](https://arxiv.org/pdf/2410.13232) | - | - | 567 | 568 | 569 | 570 | 571 | --------------------------------------------------------------------------------