├── .idea ├── vcs.xml ├── misc.xml ├── .gitignore ├── inspectionProfiles │ ├── profiles_settings.xml │ └── Project_Default.xml ├── Awesome-MLLM-Datasets.iml └── modules.xml ├── LICENSE └── README.md /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /shelf/ 3 | /workspace.xml 4 | # Editor-based HTTP Client requests 5 | /httpRequests/ 6 | # Datasource local storage ignored files 7 | /dataSources/ 8 | /dataSources.local.xml 9 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | -------------------------------------------------------------------------------- /.idea/Awesome-MLLM-Datasets.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Jinbo Ma 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/Project_Default.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 162 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-MLLM-Datasets 2 | 🚀🚀🚀This project aims to collect and collate various datasets for multimodal large model training, including but not limited to pre-training data, instruction fine-tuning data, and In-Context learning data. 3 | 4 | 💡💡💡The goal of the project is to provide researchers with a comprehensive repository of resources to support their ability to more easily access high-quality datasets when developing and optimizing multimodal AI systems. 5 | 6 | **Table of Contents** 7 | - [Datasets of Pre-Training](#datasets-of-pre-training) 8 | - [Datasets of Multimodal Instruction Tuning](#datasets-of-multimodal-instruction-tuning) 9 | - [Datasets of In-Context Learning](#datasets-of-in-context-learning) 10 | - [Datasets of Multimodal Chain-of-Thought](#datasets-of-multimodal-chain-of-thought) 11 | - [Datasets of Multimodal RLHF](#datasets-of-multimodal-rlhf) 12 | - [Benchmarks for Evaluation](#benchmarks-for-evaluation) 13 | 14 | ## Datasets of Pre-Training 15 | 16 | | Name | #.X | #.T | #.X-T | Paper | Link | Type | 17 | |:---------------------------|:---------------------------:|:-----:|:------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------:| 18 | | **WebLI** | 10B(Images) | 12B | 12B | [PaLI: A Jointly-Scaled Multilingual Language-Image Model](https://arxiv.org/pdf/2209.06794) | [Link](https://github.com/kyegomez/PALI) | Captions(109 languages) | 19 | | **LAION-5B** | 5.9B(Images) | 5.9B | 5.9B | [LAION-5B: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402v1) | [Link](https://laion.ai/blog/laion-5b/) | Captions(Multiple languages) | 20 | | **LAION-en** | 2.3B(Images) | 2.3B | 2.3B | [LAION-5B: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402v1) | [Link](https://laion.ai/blog/laion-5b/) | Captions(English) | 21 | | **ALIGN** | 1.8B(Images) | 1.8B | 1.8B | [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/pdf/2102.05918v2) | [Link](https://research.google/blog/align-scaling-up-visual-and-vision-language-representation-learning-with-noisy-text-supervision/) | Captions(English) | 22 | | **DataComp** | 1.4B(Images) | 1.4B | 1.4B | [DATACOMP: In search of the next generation of multimodal datasets](https://openreview.net/pdf?id=dVaWCDMBof) | [Link](https://huggingface.co/datasets/mlfoundations/datacomp_pools) | Captions(English) | 23 | | **COYO** | 747M(Images) | 747M | 747M | [COYO-700M: Large-scale Image-Text Pair Dataset](https://github.com/kakaobrain/coyo-dataset) | [Link](https://github.com/kakaobrain/coyo-dataset) | Captions(English) | 24 | | **LAION-COCO** | 600M(Images) | 600M | 600M | [LAION COCO: 600M SYNTHETIC CAPTIONS FROM LAION2B-EN](https://laion.ai/blog/laion-coco/) | [Link](https://huggingface.co/datasets/laion/laion-coco) | Captions(English) | 25 | | **LAION-400M** | 400M(Images) | 400M | 400M | [LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs](https://arxiv.org/pdf/2111.02114v1) | [Link](https://laion.ai/blog/laion-400-open-dataset/) | Captions(English) | 26 | | **Episodic WebLI** | 400M(Images) | 400M | 400M | [PaLI-X: On Scaling up a Multilingual Vision and Language Model](https://arxiv.org/pdf/2305.18565) | - | Captions(English) | 27 | | **CLIP** | 400M(Images) | 400M | 400M | [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2106.02524v1) | [Link](https://github.com/openai/CLIP) | Captions(English) | 28 | | **LTIP** | 312M(Images) | 312M | 312M | [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/pdf/2204.14198v2) | - | Captions(English) | 29 | | **FILIP** | 300M(Images) | 300M | 300M | [FILIP: Fine-grained Interactive Language-Image Pre-Training](https://arxiv.org/pdf/2111.07783v1) | - | Captions(English) | 30 | | **LAION-zh** | 142M(Images) | 142M | 142M | [LAION-5B: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402v1) | [Link](https://huggingface.co/datasets/wanng/laion-high-resolution-chinese) | Captions(Chinese) | 31 | | **Obelics** | 353M(Images) | 115M | 141M | [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://openreview.net/pdf?id=SKN2hflBIZ) | [Link](https://huggingface.co/datasets/HuggingFaceM4/OBELICS) | Interleaved image-text web documents | 32 | | **MMC4** | 571M(Images) | 43B | 101.2M | [Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text](https://arxiv.org/pdf/2304.06939v1) | [Link](https://github.com/allenai/mmc4) | Interleaved image-text | 33 | | **Wukong** | 101M(Images) | 101M | 101M | [WuKong:100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework](https://arxiv.org/pdf/2202.06767) | [Link](https://huggingface.co/datasets/wanng/wukong100m) | Captions(Chinese) | 34 | | **M3W** | 185M(Images) | 182GB | 43.3M | [Flamingo: a Visual Language Model for Few-Shot Learning](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf) | - | Captions(English) | 35 | | **WIT** | 11.5M(Images) | 37.6M | 37.6M | [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning](https://arxiv.org/pdf/2103.01913v2) | [Link](https://huggingface.co/datasets/google/wit) | Captions(English) | 36 | | **GQA** | 113K(Images) | 22M | 22M | [GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering](https://openaccess.thecvf.com/content_CVPR_2019/papers/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.pdf) | [Link](https://cs.stanford.edu/people/dorarad/gqa/about.html) | Visual Reasoning and Compositional Question Answering(English) | 37 | | **CC12M** | 12.4M(Images) | 12.4M | 12.4M | [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/pdf/2102.08981v2) | [Link](https://github.com/google-research-datasets/conceptual-12m) | Captions(English) | 38 | | **Red Caps** | 12M(Images) | 12M | 12M | [RedCaps: Web-curated image-text data created by the people, for the people](https://arxiv.org/pdf/2111.11431v1) | [Link](https://redcaps.xyz/) | Captions(English) | 39 | | **Visual Genome** | 108k(Images) | 4.5M | 4.5M | [Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations](https://arxiv.org/pdf/1602.07332v1) | [Link](https://huggingface.co/datasets/visual_genome) | Annotations(English) | 40 | | **ArXivCap** | 6.4M(Images) | 3.9M | 3.9M | [Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models](https://arxiv.org/pdf/2403.00231) | [Link](https://huggingface.co/datasets/MMInstruction/ArxivCap) | Captions(English) | 41 | | **DVQA** | 300K(Images) | 3.5M | 3.5M | [DVQA: Understanding Data Visualizations via Question Answering](https://openaccess.thecvf.com/content_cvpr_2018/papers/Kafle_DVQA_Understanding_Data_CVPR_2018_paper.pdf) | [Link](https://github.com/kushalkafle/DVQA_dataset) | Question answering(English) | 42 | | **CC3M** | 3.3M(Images) | 3.3M | 3.3M | [Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning](https://aclanthology.org/P18-1238.pdf) | [Link](https://huggingface.co/datasets/conceptual_captions) | Captions(English) | 43 | | **MS-COCO** | 328k(Images) | 2.5M | 2.5M | [Microsoft COCO: Common Objects in Context](https://arxiv.org/pdf/1405.0312v3) | [Link](https://cocodataset.org/#home) | Object detection,Segmentation,Caption(English) | 44 | | **AI Challenger Captions** | 300K(Images) | 1.5M | 1.5M | [AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding](https://arxiv.org/pdf/1711.06475) | [Link](https://github.com/AIChallenger/AI_Challenger_2017) | Captions(English) | 45 | | **VQA v2** | 265K(Images) | 1.4M | 1.4M | [Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering](https://arxiv.org/pdf/1612.00837) | [Link](https://huggingface.co/datasets/vqa_v2) | Visual question answering(English) | 46 | | **VisDial** | 120K(Images) | 1.2M | 1.2M | [Visual Dialog](https://arxiv.org/pdf/1611.08669v5) | [Link](https://visualdialog.org/) | Visual question answering(English) | 47 | | **SBU(Image Caption)** | 1M(Images) | 1M | 1M | [Im2Text: Describing Images Using 1 Million Captioned Photographs](https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf) | [Link](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/SBUcaptions.md) | Captions(English) | 48 | | **OCR-VQA** | 207K(Images) | 1M | 1M | [OCR-VQA: Visual Question Answering by Reading Text in Images](https://anandmishra22.github.io/files/mishra-OCR-VQA.pdf) | [Link](https://ocr-vqa.github.io/) | Visual question answering(English) | 49 | | **COCO Caption** | 164K(Images) | 1M | 1M | [Microsoft COCO Captions: Data Collection and Evaluation Server](https://arxiv.org/pdf/1504.00325) | [Link](https://cocodataset.org/#home) | Object detection,Segmentation,Caption(English) | 50 | | **CC595k** | 595K(Images) | 595K | 595K | [Visual Instruction Tuning](https://openreview.net/pdf?id=w0H2xGHlkw) | [Link](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | Captions(English) | 51 | | **Visual-7W** | 47.3K(Images) | 328K | 328K | [Visual7W: Grounded Question Answering in Images](https://arxiv.org/pdf/1511.03416v4) | [Link](https://ai.stanford.edu/~yukez/visual7w/) | Visual question answering(English) | 52 | | **Flickr30k** | 31K(Images) | 158K | 158K | [From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions](https://aclanthology.org/Q14-1006.pdf) | [Link](https://bryanplummer.com/Flickr30kEntities/) | Visual Grounding(English) | 53 | | **TextCaps** | 28K(Images) | 145K | 145K | [TextCaps: a Dataset for Image Captioning with Reading Comprehension](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123470732.pdf) | [Link](https://textvqa.org/textcaps/) | Captions(English) | 54 | | **RefCOCO** | 20K(Images) | 142K | 142K | [ReferItGame: Referring to Objects in Photographs of Natural Scenes](https://aclanthology.org/D14-1086.pdf) | [Link](https://github.com/lichengunc/refer) | Visual Grounding(English) | 55 | | **RefCOCO+** | 20K(Images) | 142K | 142K | [Modeling Context in Referring Expressions](https://arxiv.org/pdf/1608.00272v3) | [Link](https://github.com/lichengunc/refer) | Visual Grounding(English) | 56 | | **RefCOCOg** | 26.7K(Images) | 85.5K | 85.5K | [Modeling Context in Referring Expressions](https://arxiv.org/pdf/1608.00272v3) | [Link](https://github.com/lichengunc/refer) | Visual Grounding(English) | 57 | | **TextVQA** | 28.4(Images) | 45.3K | 45.3K | [Towards VQA Models That Can Read](https://openaccess.thecvf.com/content_CVPR_2019/papers/Singh_Towards_VQA_Models_That_Can_Read_CVPR_2019_paper.pdf) | [Link](https://textvqa.org/dataset/) | Visual question answering(English) | 58 | | **DocVQA** | 12K(Images) | 50K | 50K | [DocVQA:A Dataset for VQA on Document Images](https://arxiv.org/pdf/2007.00398) | [Link](https://www.docvqa.org/datasets) | Document VQA(English) | 59 | | **ST-VQA** | 23K(Images) | 32K | 32K | [Scene Text Visual Question Answering](https://openaccess.thecvf.com/content_ICCV_2019/papers/Biten_Scene_Text_Visual_Question_Answering_ICCV_2019_paper.pdf) | [Link](https://huggingface.co/datasets/vikhyatk/st-vqa) | Visual question answering(English) | 60 | | **A-OKVQA** | 23.7K(Images) | 24.9K | 24.9K | [A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136680141.pdf) | [Link](https://github.com/allenai/aokvqa) | Visual question answering(English) | 61 | | **ArxivQA** | 32K(Images) | 16.6K | 16.6K | [Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models](https://arxiv.org/pdf/2403.00231) | [Link](https://huggingface.co/datasets/MMInstruction/ArxivQA) | Visual question answering(English) | 62 | | **OK-VQA** | 14K(Images) | 14K | 14K | [OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge](https://openaccess.thecvf.com/content_CVPR_2019/papers/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.pdf) | [Link](https://okvqa.allenai.org/) | Visual question answering(English) | 63 | | **WebVid** | 10M(Video) | 10M | 10M | [Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval](https://openaccess.thecvf.com/content/ICCV2021/papers/Bain_Frozen_in_Time_A_Joint_Video_and_Image_Encoder_for_ICCV_2021_paper.pdf) | [Link](https://huggingface.co/datasets/TempoFunk/webvid-10M) | Captions(English) | 64 | | **MSRVTT** | 10K(Video) | 200K | 200K | [MSR-VTT: A Large Video Description Dataset for Bridging Video and Language](https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf) | [Link](https://opendatalab.com/OpenDataLab/MSR-VTT/) | Captions(English) | 65 | | **YFCC100M** | 99.2M(Images), 0.8M(Videos) | - | - | [YFCC100M: The New Data in Multimedia Research](https://arxiv.org/pdf/1503.01817v2) | [Link](https://pypi.org/project/yfcc100m/) | - | 66 | | **VSDial-CN** | 120K (Image), 1.2M(Audio) | 120K | 1.2M | [VILAS: EXPLORING THE EFFECTS OF VISION AND LANGUAGE CONTEXT IN AUTOMATIC SPEECH RECOGNITION](https://arxiv.org/pdf/2305.19972v2) | - | Visual spoken dialogue | 67 | | **AISHELL-2** | - | - | 1M | [AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale](https://arxiv.org/pdf/1808.10583v2) | [Link](https://github.com/kaldi-asr/kaldi/tree/master/egs/aishell2) | Audio Captions(Chinese) | 68 | | **AISHELL-1** | - | - | 128K | [AISHELL-1: AN OPEN-SOURCE MANDARIN SPEECH CORPUS AND A SPEECH RECOGNITION BASELINE](https://arxiv.org/pdf/1709.05522v1) | [Link](https://www.openslr.org/33/) | Audio Captions(Chinese) | 69 | | **WavCaps** | 403K(Audio) | 403K | 403K | [WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research](https://arxiv.org/pdf/2303.17395v2) | [Link](https://huggingface.co/datasets/cvssp/WavCaps) | Audio Captions(English) | 70 | 71 | 72 | ## Datasets of Multimodal Instruction Tuning 73 | 74 | | Name | I->O | Method | #.Instance | Paper | Link | 75 | |:------------------------|:----------------:|-----------|------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:| 76 | | **MiniGPT-4's IT** | I+T->T | Auto | 5K | [MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models](https://arxiv.org/pdf/2304.10592.pdf) | [Link](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align) | 77 | | **StableLLaVA** | I+T->T | Auto+Manu | 126K | [Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data](https://arxiv.org/pdf/2308.10253) | [Link](https://github.com/icoz69/StableLLAVA) | 78 | | **LLaVA-Instruct-150K** | I+T->T | Auto | 158K | [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485.pdf) | [Link](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | 79 | | **SVIT** | I+T->T | Auto | 4.2M | [SVIT: Scaling up Visual Instruction Tuning](https://arxiv.org/pdf/2307.04087) | [Link](https://huggingface.co/datasets/BAAI/SVIT) | 80 | | **LLaVAR's IT** | I+T->T | Auto | 174K | [LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding](https://arxiv.org/pdf/2306.17107) | [Link](https://llavar.github.io/#data) | 81 | | **ShareGPT4V's IT** | I+T->T | Auto+Manu | 102K | [ShareGPT4V: Improving Large Multi-modal Models with Better Captions](https://arxiv.org/pdf/2311.12793) | [Link](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) | 82 | | **ShareGPT4Video's IT** | I+T->T | Auto+Manu | 4.84M | [ShareGPT4Video: Improving Video Understanding and Generation with Better Captions](https://arxiv.org/pdf/2406.04325v1) | [Link](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video) | 83 | | **DRESS's IT** | I+T->T | Auto+Manu | 193K | [DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback](https://arxiv.org/pdf/2311.10081) | [Link](https://huggingface.co/datasets/YangyiYY/LVLM_NLF) | 84 | | **SoM-LLaVA's IT** | I+T->T | Auto+Manu | 695K | [List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs](https://arxiv.org/abs/2404.16375) | [Link](https://huggingface.co/datasets/zzxslp/SoM-LLaVA) | 85 | | **VideoChat's IT** | V+T->T | Auto | 11K | [VideoChat : Chat-Centric Video Understanding](https://arxiv.org/pdf/2305.06355) | [Link](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) | 86 | | **Video-ChatGPT's IT** | V+T->T | Inherit | 100K | [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https://arxiv.org/pdf/2306.05424.pdf) | [Link](https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open_file_folder) | 87 | | **Video-LLaMA's IT** | I/V+T->T | Auto | 171K | [Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/pdf/2306.02858) | [Link](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) | 88 | | **InstructBLIP's IT** | I/V+T->T | Auto | 1.6M | [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/pdf/2305.06500) | [Link](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | 89 | | **X-InstructBLIP's IT** | I/V/3D/A+T->T | Auto | 1.8M | [X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning](https://arxiv.org/pdf/2311.18799) | [Link](https://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip) | 90 | | **MIMIC-IT** | I/V+T->T | Auto | 2.8M | [MIMIC-IT: Multi-Modal In-Context Instruction Tuning](https://arxiv.org/pdf/2306.05425.pdf) | [Link](https://github.com/Luodian/Otter/blob/main/mimic-it/README.md) | 91 | | **PandaGPT's IT** | I+T->T | Inherit | 160K | [PandaGPT: One Model To Instruction-Follow Them All](https://arxiv.org/pdf/2305.16355) | [Link](https://panda-gpt.github.io/) | 92 | | **MGVLID** | I+B+T->T | Auto+Manu | 108K | [ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning](https://arxiv.org/pdf/2307.09474.pdf) | - | 93 | | **M³IT** | I/V/B+T->T | Auto+Manu | 2.4M | [M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning](https://arxiv.org/pdf/2306.04387.pdf) | [Link](https://huggingface.co/datasets/MMInstruction/M3IT) | 94 | | **LAMM-Dataset** | I+3D+T->T | Auto+Manu | 196K | [LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark](https://arxiv.org/pdf/2306.06687.pdf) | [Link](https://github.com/OpenLAMM/LAMM#lamm-dataset) | 95 | | **BuboGPT's IT** | (I+A)/A+T->T | Auto | 9K | [BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs](https://arxiv.org/pdf/2307.08581.pdf) | [Link](https://huggingface.co/datasets/magicr/BuboGPT) | 96 | | **mPLUG-DocOwl's IT** | I/Tab/Web/+T->T | Inherit | - | [mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding](https://arxiv.org/pdf/2307.02499.pdf) | [Link](https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocLLM) | 97 | | **T2M** | T->I/V/A+T | Auto | 14.7K | [NExT-GPT: Any-to-Any Multimodal LLM](https://arxiv.org/pdf/2309.05519) | [Link](https://github.com/NExT-GPT/NExT-GPT/tree/main/data/IT_data/T-T+X_data) | 98 | | **MosIT** | I+V+A+T->I+V+A+T | Auto+Manu | 5K | [NExT-GPT: Any-to-Any Multimodal LLM](https://arxiv.org/pdf/2309.05519) | [Link](https://github.com/NExT-GPT/NExT-GPT/tree/main/data/IT_data/MosIT_data) | 99 | | **Osprey's IT** | I+T->T | Auto+Manu | 724K | [Osprey: Pixel Understanding with Visual Instruction Tuning](https://arxiv.org/pdf/2312.1003) | [Link](https://github.com/CircleRadon/Osprey) | 100 | | **X-LLM** | I+V+A+T->T | Manu | 10K | [X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages](https://arxiv.org/pdf/2305.04160.pdf) | [Link](https://github.com/phellonchen/X-LLM) | 101 | | **MULTIS** | I+T->T | Auto+Manu | 161.5K | [ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst](https://arxiv.org/pdf/2305.16103.pdf) | [Coming soon](https://iva-chatbridge.github.io/) | 102 | | **DetGPT** | I+T->T | Auto | 30K | [DetGPT: Detect What You Need via Reasoning](https://arxiv.org/pdf/2305.14167.pdf) | [Link](https://github.com/OptimalScale/DetGPT/tree/main/dataset) | 103 | | **LVIS-Instruct4V** | I+T->T | Auto | 220K | [To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning](https://arxiv.org/pdf/2311.07574.pdf) | [Link](https://huggingface.co/datasets/X2FD/LVIS-Instruct4V) | 104 | | **GPT4Tools** | I+T->T | Auto | 71K | [GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction](https://arxiv.org/pdf/2305.18752.pdf) | [Link](https://github.com/StevenGrove/GPT4Tools#dataset) | 105 | | **SparklesDialogue** | I+T->T | Auto | 6.4K | [✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models](https://arxiv.org/pdf/2308.16463.pdf) | [Link](https://github.com/HYPJUDY/Sparkles#sparklesdialogue) 106 | | **ColonINST** | I+T->T | Auto | 450K | [Frontiers in Intelligent Colonoscopy](https://arxiv.org/abs/2410.17241) (medical domain) | [Link](https://github.com/ai4colonoscopy/IntelliScope) | | 107 | 108 | [//]: # (| **UNK-VQA** | | | | [UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models](https://arxiv.org/pdf/2310.10942) | [Link](https://github.com/guoyang9/UNK-VQA) |) 109 | 110 | [//]: # (| **VEGA** | | | | [VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models](https://arxiv.org/pdf/2406.10228) | [Link](https://github.com/zhourax/VEGA) | ) 111 | 112 | [//]: # (| **ALLaVA-4V** | | | | [ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model](https://arxiv.org/pdf/2402.11684.pdf) | [Link](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V) | ) 113 | 114 | [//]: # (| **IDK** | | | | [Visually Dehallucinative Instruction Generation: Know What You Don't Know](https://arxiv.org/pdf/2402.09717.pdf) | [Link](https://github.com/ncsoft/idk) |) 115 | 116 | [//]: # (| **CAP2QA** | | | | [Visually Dehallucinative Instruction Generation](https://arxiv.org/pdf/2402.08348.pdf) | [Link](https://github.com/ncsoft/cap2qa) |) 117 | 118 | [//]: # (| **M3DBench** | | | | [M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts](https://arxiv.org/pdf/2312.10763.pdf) | [Link](https://github.com/OpenM3D/M3DBench) |) 119 | 120 | [//]: # (| **ViP-LLaVA-Instruct** | | | | [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/pdf/2312.00784.pdf) | [Link](https://huggingface.co/datasets/mucai/ViP-LLaVA-Instruct) |) 121 | 122 | [//]: # (| **ComVint** | | | | [What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning](https://arxiv.org/pdf/2311.01487.pdf) | [Link](https://github.com/RUCAIBox/ComVint#comvint-data) |) 123 | 124 | [//]: # (| **M-HalDetect** | | | | [Detecting and Preventing Hallucinations in Large Vision Language Models](https://arxiv.org/pdf/2308.06394.pdf) | - | ) 125 | 126 | [//]: # (| **PF-1M** | | | | [Visual Instruction Tuning with Polite Flamingo](https://arxiv.org/pdf/2307.01003.pdf) | [Link](https://huggingface.co/datasets/chendelong/PF-1M/tree/main) | ) 127 | 128 | [//]: # (| **ChartLlama** | | | | [ChartLlama: A Multimodal LLM for Chart Understanding and Generation](https://arxiv.org/pdf/2311.16483.pdf) | [Link](https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset) |) 129 | 130 | [//]: # (| **MotionGPT** | | | | [MotionGPT: Human Motion as a Foreign Language](https://arxiv.org/pdf/2306.14795.pdf) | [Link](https://github.com/OpenMotionLab/MotionGPT) |) 131 | 132 | [//]: # (| **LRV-Instruction** | | | | [Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https://arxiv.org/pdf/2306.14565.pdf) | [Link](https://github.com/FuxiaoLiu/LRV-Instruction#visual-instruction-data-lrv-instruction) | ) 133 | 134 | [//]: # (| **Macaw-LLM** | | | | [Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration](https://arxiv.org/pdf/2306.09093.pdf) | [Link](https://github.com/lyuchenyang/Macaw-LLM/tree/main/data) | ) 135 | 136 | [//]: # (| **LLaVA-Med** | | | | [LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](https://arxiv.org/pdf/2306.00890.pdf) | [Coming soon](https://github.com/microsoft/LLaVA-Med#llava-med-dataset) |) 137 | 138 | [//]: # (| **PMC-VQA** | | | | [PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering](https://arxiv.org/pdf/2305.10415.pdf) | [Coming soon](https://xiaoman-zhang.github.io/PMC-VQA/) |) 139 | 140 | [//]: # (| **LMEye** | | | | [LMEye: An Interactive Perception Network for Large Language Models](https://arxiv.org/pdf/2305.03701.pdf) | [Link](https://huggingface.co/datasets/YunxinLi/Multimodal_Insturction_Data_V2) |) 141 | 142 | [//]: # (| **MultiInstruct** | | | | [MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning](https://arxiv.org/pdf/2212.10773.pdf) | [Link](https://github.com/VT-NLP/MultiInstruct) |) 143 | 144 | 145 | ## Datasets of In-Context Learning 146 | 147 | | Name | Paper | Link | Notes | 148 | |:-------------|:--------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| 149 | | **MIC** | [MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning](https://arxiv.org/pdf/2309.07915.pdf) | [Link](https://huggingface.co/datasets/BleachNick/MIC_full) | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. | 150 | | **MIMIC-IT** | [MIMIC-IT: Multi-Modal In-Context Instruction Tuning](https://arxiv.org/pdf/2306.05425.pdf) | [Link](https://github.com/Luodian/Otter/blob/main/mimic-it/README.md) | Multimodal in-context instruction dataset | 151 | 152 | ## Datasets of Multimodal Chain-of-Thought 153 | 154 | | Name | Paper | Link | Notes | 155 | |:--------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------:| 156 | | **EMER** | [Explainable Multimodal Emotion Reasoning](https://arxiv.org/pdf/2306.15401.pdf) | [Link](https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning) | A benchmark dataset for explainable emotion reasoning task | 157 | | **EgoCOT** | [EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought](https://arxiv.org/pdf/2305.15021.pdf) | [Link](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch) | Large-scale embodied planning dataset | 158 | | **VIP** | [Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction](https://arxiv.org/pdf/2305.13903.pdf) | - | An inference-time dataset that can be used to evaluate VideoCOT | 159 | | **ScienceQA** | [Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering](https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) | [Link](https://github.com/lupantech/ScienceQA#ghost-download-the-dataset) | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains | 160 | 161 | ## Datasets of Multimodal RLHF 162 | 163 | | Name | I->O | Method | #.Instance | Paper | Link | 164 | |:-----------------|:--------:|-----------|------------|:----------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------:| 165 | | **VLFeedback** | I+T->T | Auto | 80K | [Silkie: Preference Distillation for Large Visual Language Models](https://arxiv.org/pdf/2312.10665.pdf) | [Link](https://huggingface.co/datasets/MMInstruction/VLFeedback) | 166 | | **LLaVA-RLHF** | I+T->T | Manu | 10K | [Aligning large multimodal models with factually augmented rlhf](https://arxiv.org/pdf/2309.14525) | [Link](https://huggingface.co/datasets/zhiqings/LLaVA-Human-Preference-10K) | 167 | | **DRESS's IT** | I+T->T | Auto+Manu | - | [DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback](https://arxiv.org/pdf/2311.10081) | [Link](https://huggingface.co/datasets/YangyiYY/LVLM_NLF) | 168 | | **RLHF-V's IT** | I+T->T | Manu | 1.4K | [RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback](https://arxiv.org/pdf/2312.00849) | [Link](https://huggingface.co/datasets/HaoyeZhang/RLHF-V-Dataset) | 169 | | **RTVLM** | I+T->T | Auto+Manu | 5K | [Red Teaming Visual Language Models](https://arxiv.org/pdf/2401.12915) | [Link](https://huggingface.co/datasets/MMInstruction/RedTeamingVLM) | 170 | | **VLGuard's IT** | I+T->T | Auto | 3K | [Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models](https://arxiv.org/pdf/2402.02207) | [Link](https://huggingface.co/datasets/ys-zong/VLGuard) | 171 | | **MMViG** | I+T->T | Manu | 16K | [ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling](https://arxiv.org/pdf/2402.06118) | - | 172 | 173 | ## Benchmarks for Evaluation 174 | 175 | | Name | Paper | Link | Notes | 176 | |:-----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------:| 177 | | **MME-RealWorld** | [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://arxiv.org/pdf/2408.13257) | [Link](https://huggingface.co/datasets/yifanzhang114/MME-RealWorld) | A challenging benchmark that involves real-life scenarios | 178 | | **CharXiv** | [CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs](https://arxiv.org/pdf/2406.18521) | [Link](https://huggingface.co/datasets/princeton-nlp/CharXiv) | Chart understanding benchmark curated by human experts | 179 | | **Video-MME** | [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https://arxiv.org/pdf/2405.21075) | [Link](https://github.com/BradyFU/Video-MME) | A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis | 180 | | **VL-ICL Bench** | [VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning](https://arxiv.org/pdf/2403.13164.pdf) | [Link](https://github.com/ys-zong/VL-ICL) | A benchmark for M-ICL evaluation, covering a wide spectrum of tasks | 181 | | **TempCompass** | [TempCompass: Do Video LLMs Really Understand Videos?](https://arxiv.org/pdf/2403.00476.pdf) | [Link](https://github.com/llyx97/TempCompass) | A benchmark to evaluate the temporal perception ability of Video LLMs | 182 | | **CoBSAT** | [Can MLLMs Perform Text-to-Image In-Context Learning?](https://arxiv.org/pdf/2402.01293.pdf) | [Link](https://huggingface.co/datasets/yzeng58/CoBSAT) | A benchmark for text-to-image ICL | 183 | | **VQAv2-IDK** | [Visually Dehallucinative Instruction Generation: Know What You Don't Know](https://arxiv.org/pdf/2402.09717.pdf) | [Link](https://github.com/ncsoft/idk) | A benchmark for assessing "I Know" visual hallucination | 184 | | **Math-Vision** | [Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset](https://arxiv.org/pdf/2402.14804.pdf) | [Link](https://github.com/mathvision-cuhk/MathVision) | A diverse mathematical reasoning benchmark | 185 | | **CMMMU** | [CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark](https://arxiv.org/pdf/2401.11944.pdf) | [Link](https://github.com/CMMMU-Benchmark/CMMMU) | A Chinese benchmark involving reasoning and knowledge across multiple disciplines | 186 | | **MMCBench** | [Benchmarking Large Multimodal Models against Common Corruptions](https://arxiv.org/pdf/2401.11943.pdf) | [Link](https://github.com/sail-sg/MMCBench) | A benchmark for examining self-consistency under common corruptions | 187 | | **MMVP** | [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/pdf/2401.06209.pdf) | [Link](https://github.com/tsb0601/MMVP) | A benchmark for assessing visual capabilities | 188 | | **TimeIT** | [TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding](https://arxiv.org/pdf/2312.02051.pdf) | [Link](https://huggingface.co/datasets/ShuhuaiRen/TimeIT) | A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks. | 189 | | **ViP-Bench** | [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/pdf/2312.00784.pdf) | [Link](https://huggingface.co/datasets/mucai/ViP-Bench) | A benchmark for visual prompts | 190 | | **M3DBench** | [M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts](https://arxiv.org/pdf/2312.10763.pdf) | [Link](https://github.com/OpenM3D/M3DBench) | A 3D-centric benchmark | 191 | | **Video-Bench** | [Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models](https://arxiv.org/pdf/2311.16103.pdf) | [Link](https://github.com/PKU-YuanGroup/Video-Bench) | A benchmark for video-MLLM evaluation | 192 | | **Charting-New-Territories** | [Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs](https://arxiv.org/pdf/2311.14656.pdf) | [Link](https://github.com/jonathan-roberts1/charting-new-territories) | A benchmark for evaluating geographic and geospatial capabilities | 193 | | **MLLM-Bench** | [MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V](https://arxiv.org/pdf/2311.13951.pdf) | [Link](https://github.com/FreedomIntelligence/MLLM-Bench) | GPT-4V evaluation with per-sample criteria | 194 | | **BenchLMM** | [BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models](https://arxiv.org/pdf/2312.02896.pdf) | [Link](https://huggingface.co/datasets/AIFEG/BenchLMM) | A benchmark for assessment of the robustness against different image styles | 195 | | **MMC-Benchmark** | [MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning](https://arxiv.org/pdf/2311.10774.pdf) | [Link](https://github.com/FuxiaoLiu/MMC) | A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts | 196 | | **MVBench** | [MVBench: A Comprehensive Multi-modal Video Understanding Benchmark](https://arxiv.org/pdf/2311.17005.pdf) | [Link](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/MVBENCH.md) | A comprehensive multimodal benchmark for video understanding | 197 | | **Bingo** | [Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges](https://arxiv.org/pdf/2311.03287.pdf) | [Link](https://github.com/gzcch/Bingo) | A benchmark for hallucination evaluation that focuses on two common types | 198 | | **MagnifierBench** | [OtterHD: A High-Resolution Multi-modality Model](https://arxiv.org/pdf/2311.04219.pdf) | [Link](https://huggingface.co/datasets/Otter-AI/MagnifierBench) | A benchmark designed to probe models' ability of fine-grained perception | 199 | | **HallusionBench** | [HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models](https://arxiv.org/pdf/2310.14566.pdf) | [Link](https://github.com/tianyi-lab/HallusionBench) | An image-context reasoning benchmark for evaluation of hallucination | 200 | | **PCA-EVAL** | [Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond](https://arxiv.org/pdf/2310.02071.pdf) | [Link](https://github.com/pkunlp-icler/PCA-EVAL) | A benchmark for evaluating multi-domain embodied decision-making. | 201 | | **MMHal-Bench** | [Aligning Large Multimodal Models with Factually Augmented RLHF](https://arxiv.org/pdf/2309.14525.pdf) | [Link](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) | A benchmark for hallucination evaluation | 202 | | **MathVista** | [MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models](https://arxiv.org/pdf/2310.02255.pdf) | [Link](https://huggingface.co/datasets/AI4Math/MathVista) | A benchmark that challenges both visual and math reasoning capabilities | 203 | | **SparklesEval** | [✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models](https://arxiv.org/pdf/2308.16463.pdf) | [Link](https://github.com/HYPJUDY/Sparkles#sparkleseval) | A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. | 204 | | **ISEKAI** | [Link-Context Learning for Multimodal LLMs](https://arxiv.org/pdf/2308.07891.pdf) | [Link](https://huggingface.co/ISEKAI-Portal) | A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning | 205 | | **M-HalDetect** | [Detecting and Preventing Hallucinations in Large Vision Language Models](https://arxiv.org/pdf/2308.06394.pdf) | [Coming soon]() | A dataset used to train and benchmark models for hallucination detection and prevention | 206 | | **I4** | [Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions](https://arxiv.org/pdf/2308.04152.pdf) | [Link](https://github.com/DCDmllm/Cheetah) | A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions | 207 | | **SciGraphQA** | [SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs](https://arxiv.org/pdf/2308.03349.pdf) | [Link](https://github.com/findalexli/SciGraphQA#data) | A large-scale chart-visual question-answering dataset | 208 | | **MM-Vet** | [MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities](https://arxiv.org/pdf/2308.02490.pdf) | [Link](https://github.com/yuweihao/MM-Vet) | An evaluation benchmark that examines large multimodal models on complicated multimodal tasks | 209 | | **SEED-Bench** | [SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension](https://arxiv.org/pdf/2307.16125.pdf) | [Link](https://github.com/AILab-CVC/SEED-Bench) | A benchmark for evaluation of generative comprehension in MLLMs | 210 | | **MMBench** | [MMBench: Is Your Multi-modal Model an All-around Player?](https://arxiv.org/pdf/2307.06281.pdf) | [Link](https://github.com/open-compass/MMBench) | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models | 211 | | **Lynx** | [What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?](https://arxiv.org/pdf/2307.02469.pdf) | [Link](https://github.com/bytedance/lynx-llm#prepare-data) | A comprehensive evaluation benchmark including both image and video tasks | 212 | | **GAVIE** | [Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https://arxiv.org/pdf/2306.14565.pdf) | [Link](https://github.com/FuxiaoLiu/LRV-Instruction#evaluationgavie) | A benchmark to evaluate the hallucination and instruction following ability | 213 | | **MME** | [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https://arxiv.org/pdf/2306.13394.pdf) | [Link](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) | A comprehensive MLLM Evaluation benchmark | 214 | | **LVLM-eHub** | [LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models](https://arxiv.org/pdf/2306.09265.pdf) | [Link](https://github.com/OpenGVLab/Multi-Modality-Arena) | An evaluation platform for MLLMs | 215 | | **LAMM-Benchmark** | [LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark](https://arxiv.org/pdf/2306.06687.pdf) | [Link](https://github.com/OpenLAMM/LAMM#lamm-benchmark) | A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks | 216 | | **M3Exam** | [M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models](https://arxiv.org/pdf/2306.05179.pdf) | [Link](https://github.com/DAMO-NLP-SG/M3Exam) | A multilingual, multimodal, multilevel benchmark for evaluating MLLM | 217 | | **OwlEval** | [mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/pdf/2304.14178.pdf) | [Link](https://github.com/X-PLUG/mPLUG-Owl/tree/main/OwlEval) | Dataset for evaluation on multiple capabilities | 218 | --------------------------------------------------------------------------------