└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome MLLM Hallucination[![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 2 | This repository collects research on the hallucination problem of the Multimodal Large Language Model(MLLM), including their papers and codes/datasets. 3 | 4 | ✈️ 5 | The main aspects involved are **Surveys**, **Benchmarks**, **Hallucination Mitigation methods** and some interesting papers that are not directly related to the current topic. Since some of the papers are relatively new and cannot be sure whether they have been included in the specific conferences, they are currently only marked according to the conference acceptance status of the articles that Google Scholar can find. 6 | 7 | Besides, we have extracted the name or the core solution's category of each paper for you to read in a targeted manner, while we believe we should re-summarize them to reach a more reasonable classification when a certain number is reached. :fireworks: 8 | 9 | If you find some interesting papers not included, please feel free to contact me. We will continue to update this repository! :sunny: 10 | 11 | :large_blue_diamond: citation >= 20   |   :star: citation >= 50   |   :fire: citation >= 100 12 | 13 | ## Contents 14 | - [Surveys](#Surveys) 15 | - [Benchmarks](#Benchmarks) 16 | - [Hallucination Mitigation methods](#Hallucination-Mitigation-methods) 17 | - [Other](#Others) 18 | 19 | ## Papers 20 | ### Surveys 21 | | **Number** | **Title** | **Venue** | **Paper** | **Repo** | **Citation** | 22 | |:--------:|:--------:|:---------:|:---------:|:---------:|:---------:| 23 | |1| A Survey of Hallucination in “Large” Foundation Models| arxiv(23.09) | [![arXiv](https://img.shields.io/badge/arXiv-2309.05922-b31b1b.svg)](https://arxiv.org/pdf/2309.05922.pdf) | :heavy_minus_sign: | :star:| 24 | |2| A Survey on Hallucination in Large Vision-Language Models| arxiv(24.02) | [![arXiv](https://img.shields.io/badge/arXiv-2402.00253-b31b1b.svg)](https://arxiv.org/pdf/2402.00253.pdf) | :heavy_minus_sign: | :heavy_minus_sign: | 25 | 26 | 27 | 28 | 29 | ### Benchmarks 30 | Here are some works that could evaluate the hallucination performances of MLLMs, including some popular benchmarks. Most work products fine-tuning using their benchmark dataset, which could reduce the likelihood of hallucinating without sacrificing its performance on other benchmarks. And some papers have designed clever ways to construct such datasets. 31 | | **Number** | **Title** | **Venue** | **Paper** | **Repo** | **Citation** | **Benchmark Name** | 32 | |:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:| 33 | |1|Evaluating Object Hallucination in Large Vision-Language Models|EMNLP(2023)|[![arXiv](https://img.shields.io/badge/arXiv-2305.10355-b31b1b.svg)](https://arxiv.org/pdf/2305.10355.pdf) | :heavy_minus_sign: | :fire: | POPE | 34 | |2|MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models| arxiv(23.06) |[![arXiv](https://img.shields.io/badge/arXiv-2306.13394-b31b1b.svg)](https://arxiv.org/pdf/2306.13394.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)|:fire: | MME (comprehensive) | 35 | |3|MMBench: Is Your Multi-modal Model an All-around Player?| arxiv(23.07) |[![arXiv](https://img.shields.io/badge/arXiv-2307.06281-b31b1b.svg)](https://arxiv.org/pdf/2307.06281.pdf) |:heavy_minus_sign:|:fire: | MMBench (comprehensive) | 36 | |4|Evaluation and Analysis of Hallucination in Large Vision-Language Models| arxiv(23.08) |[![arXiv](https://img.shields.io/badge/arXiv-2308.15126-b31b1b.svg)](https://arxiv.org/pdf/2308.15126.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/junyangwang0410/HaELM)|:large_blue_diamond: | HaELM | 37 | |5|Aligning Large Multimodal Models with Factually Augmented RLHF|arxiv(23.09) |[![arXiv](https://img.shields.io/badge/arXiv-2309.14525-b31b1b.svg)](https://arxiv.org/pdf/2309.14525.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://llava-rlhf.github.io.)|:large_blue_diamond: | MMHAL-BENCH | 38 | |6|HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models|arxiv(23.10) |[![arXiv](https://img.shields.io/badge/arXiv-2310.14566-b31b1b.svg)](https://arxiv.org/pdf/2310.14566.pdf) | [![Google Drive](https://img.shields.io/badge/Google-Drive-7395C5.svg)](https://drive.google.com/drive/folders/1C_IA5rx_Hm67TYpdNf3TL5VlM30TLGRQ) | :heavy_minus_sign: | HALLUSIONBENCH | 39 | |7|Negative object presence evaluation (nope) to measure object hallucination in vision-language models | arxiv(23.10) |[![arXiv](https://img.shields.io/badge/arXiv-2310.05338-b31b1b.svg)](https://arxiv.org/pdf/2310.05338.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| NOPE | 40 | |8|HALLE-SWITCH: CONTROLLING OBJECT HALLUCINATION IN LARGE VISION LANGUAGE MODELS|arxiv(23.10) |[![arXiv](https://img.shields.io/badge/arXiv-2310.01779-b31b1b.svg)](https://arxiv.org/pdf/2310.01779.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/bronyayang/HallE_Switch)| :heavy_minus_sign: |CCEval| 41 | |9|Ferret: Refer and ground anything anywhere at any granularity|arxiv(23.10)|[![arXiv](https://img.shields.io/badge/arXiv-2310.07704-b31b1b.svg)](https://arxiv.org/pdf/2310.07704.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/apple/ml-ferret)|:large_blue_diamond:|Ferret-Bench (consider the refer-and-ground capability)| 42 | |10|Holistic Analysis of Hallucination in GPT-4V(ision):Bias and Interference Challenges| arxiv(23.11) |[![arXiv](https://img.shields.io/badge/arXiv-2311.03287-b31b1b.svg)](https://arxiv.org/pdf/2311.03287.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/gzcch/Bingo)|:large_blue_diamond:| Bingo | 43 | |11|AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation| arxiv(23.11) |[![arXiv](https://img.shields.io/badge/arXiv-2311.07397v2-b31b1b.svg)](https://arxiv.org/pdf/2311.07397v2.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/junyangwang0410/AMBER)|:heavy_minus_sign:| AMBER | 44 | |12|Faithscore: Evaluating hallucinations in large vision-language models| arxiv(23.11) |[![arXiv](https://img.shields.io/badge/arXiv-2311.01477-b31b1b.svg)](https://arxiv.org/pdf/2311.01477.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/bcdnlp/FAITHSCORE)|:heavy_minus_sign:| Faithscore (metric)| 45 | |13|Mitigating Hallucination in Visual Language Models with Visual Supervision|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.16479-b31b1b.svg)](https://arxiv.org/pdf/2311.16479.pdf) |:heavy_minus_sign:|:heavy_minus_sign:|RAHBench| 46 | |14|Mitigating Open-Vocabulary Caption Hallucinations|arxiv(23.12)|[![arXiv](https://img.shields.io/badge/arXiv-2312.03631-b31b1b.svg)](https://arxiv.org/pdf/2312.03631.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/assafbk/mocha_code)|:heavy_minus_sign:|OpenCHAIR| 47 | |15|RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback|arxiv(23.12)|[![arXiv](https://img.shields.io/badge/arXiv-2312.00849v2-b31b1b.svg)](https://arxiv.org/pdf/2312.00849v2.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://rlhf-v.github.io)|:heavy_minus_sign:|MHumanEval| 48 | |16|Ciem: Contrastive instruction evaluation method for better instruction tuning| NeurIPS(2023) Workshop|[![arXiv](https://img.shields.io/badge/arXiv-2309.02301-b31b1b.svg)](https://arxiv.org/pdf/2309.02301.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| Ciem (and CIT for mitigation)| 49 | |17|Mitigating hallucination in large multimodal models via robust instruction tuning|ICLR(2024)|[![arXiv](https://img.shields.io/badge/openreview-net-b31b1b.svg)](https://openreview.net/pdf?id=J44HfH4JCg) |:heavy_minus_sign:|:large_blue_diamond:| GAVIE| 50 | |18|Detecting and Preventing Hallucinations in Large Vision Language Models|AAAI(2024)|[![arXiv](https://img.shields.io/badge/arXiv-2308.06394-b31b1b.svg)](https://arxiv.org/pdf/2308.06394.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/hendryx-scale/mhal-detect)|:large_blue_diamond:| M-HalDetect| 51 | |19|Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites| MMM(2024) |[![arXiv](https://img.shields.io/badge/arXiv-2312.01701v1-b31b1b.svg)](https://arxiv.org/pdf/2312.01701v1.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/bcdnlp/FAITHSCORE)|:heavy_minus_sign:| FGHE/FOHE (An upgraded version of POPE) | 52 | |20|Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models|AAAI-ReLM Workshop(2024)|[![arXiv](https://img.shields.io/badge/arXiv-2309.04041-b31b1b.svg)](https://arxiv.org/pdf/2309.04041.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| MSG-MCQ| 53 | |21|Eyes wide shut? exploring the visual shortcomings of multimodal llms|arxiv(24.01)|[![arXiv](https://img.shields.io/badge/arXiv-2401.06209-b31b1b.svg)](https://arxiv.org/pdf/2401.06209.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| MMVP| 54 | |22|Visual Hallucinations of Multi-modal Large Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.14683-b31b1b.svg)](https://arxiv.org/pdf/2402.14683.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/wenhuang2000/VHTest)|:heavy_minus_sign:| two benchmarks generated by VHTest | 55 | |23|Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.15721-b31b1b.svg)](https://arxiv.org/pdf/2402.15721.pdf) |:heavy_minus_sign:|:heavy_minus_sign:|Hal-Eval(a new category: Event Hallucination)| 56 | |24|GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.14973-b31b1b.svg)](https://arxiv.org/pdf/2402.14973.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/EQTPartners/GenCeption)|:heavy_minus_sign:| GenCeption (no need with high-quality annotation)| 57 | |25|How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.13220-b31b1b.svg)](https://arxiv.org/pdf/2402.13220.pdf) |:heavy_minus_sign:|:heavy_minus_sign:|MAD-Bench (a new category:Visual Confusion)| 58 | |26|Unified Hallucination Detection for Multimodal Large Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.03190-b31b1b.svg)](https://arxiv.org/pdf/2402.03190.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/OpenKG-ORG/EasyDetect)|:heavy_minus_sign:| MHaluBench| 59 | |27|The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.03757-b31b1b.svg)](https://arxiv.org/pdf/2402.03757.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/MasaiahHan/CorrelationQA)|:heavy_minus_sign:|CorrelationQA| 60 | |28|Definition, Quantification, and Prescriptive Remediations|arxiv(24.03)|[![arXiv](https://img.shields.io/badge/arXiv-2403.17306-b31b1b.svg)](https://arxiv.org/pdf/2403.17306.pdf) |:heavy_minus_sign:|:heavy_minus_sign:|VHILT| 61 | |29|EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models|arxiv(24.03)|[![arXiv](https://img.shields.io/badge/arXiv-2403.17306-b31b1b.svg)](https://arxiv.org/pdf/2311.15596.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/AdaCheng/EgoThink)|:heavy_minus_sign:|EgoThink| 62 | 63 | 64 | 65 | ### Hallucination Mitigation methods 66 | Here are some labels that represent the core points of the papers, corresponding to mitigation methods from different angles, you could read the surveys mentioned earlier to further understand these categories: 67 | __`data.`__: data improvement (most benchmarks)   |   __`vis.`__: vision enhancement   |   68 | __`align.`__: multimodal alignment   |   69 | __`dec.`__: decoding optimization   |   __`post.`__: post-process   |   __`other.`__: other kinds 70 | 71 | | **Number** | **Title** | **Venue** | **Paper** | **Repo** | **Citation** | **Core** | 72 | |:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:| 73 | |1|VCoder: Versatile Vision Encoders for Multimodal Large Language Models|CVPR(2024)|[![arXiv](https://img.shields.io/badge/arXiv-2312.14233-b31b1b.svg)](https://arxiv.org/pdf/2312.14233.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/SHI-Labs/VCoder)|:heavy_minus_sign:| __`vis.`__ | 74 | |2|Ferret: Refer and ground anything anywhere at any granularity|arxiv(23.10)|[![arXiv](https://img.shields.io/badge/arXiv-2310.07704-b31b1b.svg)](https://arxiv.org/pdf/2310.07704.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/apple/ml-ferret)|:large_blue_diamond:| __`vis.`__ | 75 | |3|Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model|arxiv(23.10)|[![arXiv](https://img.shields.io/badge/arXiv-2310.20357-b31b1b.svg)](https://arxiv.org/pdf/2310.20357.pdf) |:heavy_minus_sign:|:heavy_minus_sign:|__`vis.`__| 76 | |4|Video-LLaVA: Learning United Visual Representation by Alignment Before Projection|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.10122-b31b1b.svg)](https://arxiv.org/pdf/2311.10122.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/PKU-YuanGroup/Video-LLaVA)|:large_blue_diamond:| __`vis.`__ | 77 | |5|Mitigating Hallucination in Visual Language Models with Visual Supervision|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.16479-b31b1b.svg)](https://arxiv.org/pdf/2311.16479.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`vis.`__ (with SAM -> in-context)| 78 | |6|LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2402.14767-b31b1b.svg)](https://arxiv.org/pdf/2311.11860.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/rshaojimmy/JiuTian)|:heavy_minus_sign:| __`vis.`__ | 79 | |7|DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.14767-b31b1b.svg)](https://arxiv.org/pdf/2402.14767.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/DualFocus)|:heavy_minus_sign:| __`vis.`__ | 80 | |8|LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images|arxiv(24.03)|[![arXiv](https://img.shields.io/badge/arXiv-2403.11703-b31b1b.svg)](https://arxiv.org/pdf/2403.11703.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/thunlp/LLaVA-UHD)|:heavy_minus_sign:| __`vis.`__ | 81 | |9|Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models|arxiv(23.08)|[![arXiv](https://img.shields.io/badge/arXiv-2308.13437-b31b1b.svg)](https://arxiv.org/pdf/2308.13437.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/PVIT-official/PVI)|:large_blue_diamond:| __`vis.`__ __`align.`__| 82 | |10|GROUNDHOG : Grounding Large Language Models to Holistic Segmentation|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.16846-b31b1b.svg)](https://arxiv.org/pdf/2402.16846.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://groundhog-mllm.github.io/)|:heavy_minus_sign:| __`vis.`__ __`align.`__| 83 | |11|Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training|arxiv(23.08)|[![arXiv](https://img.shields.io/badge/arXiv-2210.07688-b31b1b.svg)](https://arxiv.org/pdf/2210.07688.pdf) |:heavy_minus_sign:|:large_blue_diamond:| __`align.`__| 84 | |12|Hallucination Augmented Contrastive Learning for Multimodal Large Language Model|arxiv(23.12)|[![arXiv](https://img.shields.io/badge/arXiv-2312.06968v3-b31b1b.svg)](https://arxiv.org/pdf/2312.06968v3.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl)|:heavy_minus_sign:| __`align.`__| 85 | |13|OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation|CVPR(2024)|[![arXiv](https://img.shields.io/badge/arXiv-2311.17911-b31b1b.svg)](https://arxiv.org/pdf/2311.17911.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/shikiw/OPERA)|:heavy_minus_sign:| __`dec.`__ | 86 | |14|Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding **(VCD)**|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.16922-b31b1b.svg)](https://arxiv.org/pdf/2311.16922.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/DAMO-NLP-SG/VCD)|:heavy_minus_sign:| __`dec.`__ | 87 | |15|Seeing is believing mitigating hallucination in large vision-language models via clip-guided decoding|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.15300v1-b31b1b.svg)](https://arxiv.org/pdf/2402.15300v1.pdf) |:heavy_minus_sign:|:heavy_minus_sign:|__`dec.`__| 88 | |16|IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.18476-b31b1b.svg)](https://arxiv.org/pdf/2402.18476.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`dec.`__ | 89 | |17|HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding|arxiv(24.03)|[![arXiv](https://img.shields.io/badge/arXiv-2403.00425-b31b1b.svg)](https://arxiv.org/pdf/2403.00425.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/BillChan226/HALC)|:heavy_minus_sign:| __`dec.`__ | 90 | |18|Woodpecker: Hallucination Correction for Multimodal Large Language Models|arxiv(23.10)|[![arXiv](https://img.shields.io/badge/arXiv-2310.16045-b31b1b.svg)](https://arxiv.org/pdf/2310.16045.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/BradyFU/Woodpecker)|:large_blue_diamond:| __`post.`__ | 91 | |19|Analyzing and mitigating object hallucination in large vision-language models **(LURE)**|arxiv(23.10)|[![arXiv](https://img.shields.io/badge/arXiv-2310.00754-b31b1b.svg)](https://arxiv.org/pdf/2310.00754.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/YiyangZhou/LURE)|:large_blue_diamond:| __`post.`__ | 92 | |20|TEMPORAL INSIGHT ENHANCEMENT: MITIGATING TEMPORAL HALLUCINATION IN MULTIMODAL LARGE LANGUAGE MODELS|arxiv(24.01)|[![arXiv](https://img.shields.io/badge/arXiv-2401.09861v1-b31b1b.svg)](https://arxiv.org/pdf/2401.09861v1.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`post.`__ (Correct with Tools)| 93 | |21|VIGC: Visual Instruction Generation and Correction|arxiv(23.08)|[![arXiv](https://img.shields.io/badge/arXiv-2308.12714-b31b1b.svg)](https://arxiv.org/pdf/2308.12714.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-opendatalab-7395C5.svg)](https://opendatalab.github.io/VIGC/)|:heavy_minus_sign:| __`other.`__ (Iterative Generation)| 94 | |22|Can We Edit Multimodal Large Language Models?|EMNLP(2023)|[![arXiv](https://img.shields.io/badge/arXiv-2310.08475-b31b1b.svg)](https://arxiv.org/pdf/2310.08475.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/zjunlp/EasyEdit)|:heavy_minus_sign:| __`other.`__ (Model Edition)| 95 | |23|HALO:Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models|arxiv(23.08)|[![arXiv](https://img.shields.io/badge/arXiv-2308.11764-b31b1b.svg)](https://arxiv.org/pdf/2308.11764.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/EngSalem/HaLo)|:heavy_minus_sign:| __`other.`__ (Knowledge Injection and Teacher-Student Approaches)| 96 | |24|VOLCANO: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.07362-b31b1b.svg)](https://arxiv.org/pdf/2311.07362.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/kaistAI/Volcano)|:heavy_minus_sign:| __`other.`__ (Self-Feedback as Visual Cues -> in-context)| 97 | |25|Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization **(HA-DPO)**|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.16839-b31b1b.svg)](https://arxiv.org/pdf/2311.16839.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/opendatalab/HA-DPO)|:heavy_minus_sign:| __`other.`__ (trained to Favor the Non-Hallucinating Response as a Preference Selection Task)| 98 | |26|SILKIE: Preference Distillation for Large Visual Language Models|arxiv(23.12)|[![arXiv](https://img.shields.io/badge/arXiv-2312.10665-b31b1b.svg)](https://arxiv.org/pdf/2312.10665.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/vlf-silkie/VLFeedback)|:heavy_minus_sign:| __`other.`__ (Preference Distillation)| 99 | |27|Mitigating Open-Vocabulary Caption Hallucinations **(MOCHa)**|arxiv(23.12)|[![arXiv](https://img.shields.io/badge/arXiv-2312.03631-b31b1b.svg)](https://arxiv.org/pdf/2312.03631.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/assafbk/mocha_code)|:heavy_minus_sign:|__`other.`__ (Multi-Objective RL)| 100 | |28|Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.14545-b31b1b.svg)](https://arxiv.org/pdf/2402.14545.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/yuezih/less-is-more)|:heavy_minus_sign:| __`other.`__ (Selective EOS Supervision; Data Filtering)| 101 | |29|Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.11622-b31b1b.svg)](https://arxiv.org/pdf/2402.11622.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/Hyperwjf/LogicCheckGPT)|:heavy_minus_sign:| __`other.`__ (through Logical Closed Loops [answer verification])| 102 | |30|EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.09801-b31b1b.svg)](https://arxiv.org/pdf/2402.09801.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`other.`__ (Unlearning)| 103 | |31|Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.15721-b31b1b.svg)](https://arxiv.org/pdf/2402.15721.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`other.`__ (COT)| 104 | |32|All in a Single Image: Large Multimodal Models are In-Image Learners|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.17971v1-b31b1b.svg)](https://arxiv.org/pdf/2402.17971v1.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/AGI-Edgerunners/IIL)|:heavy_minus_sign:| __`other.`__ (In-Image Learning Mechanism)| 105 | |33|Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance **(MARINE)**|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.08680-b31b1b.svg)](https://arxiv.org/pdf/2402.08680.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`other.`__ (classifier-free guidance)| 106 | |34|SKIP \N: A SIMPLE METHOD TO REDUCE HALLUCINATION IN LARGE VISION-LANGUAGE MODELS|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.01345v1-b31b1b.svg)](https://arxiv.org/pdf/2402.01345v1.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/hanmenghan/Skip-n)|:heavy_minus_sign:| __`other.`__ (Suppress Misleading Sign '\N')| 107 | |35|Evaluating and Mitigating Number Hallucinations in Large Vision-Language Models: A Consistency Perspective|arxiv(24.03)|[![arXiv](https://img.shields.io/badge/arXiv-2403.01373-b31b1b.svg)](https://arxiv.org/pdf/2403.01373.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`other.`__ (inconsistency for number hallucination)| 108 | |36|Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation|arxiv(24.04)|[![arXiv](https://img.shields.io/badge/arXiv-2404.06809-b31b1b.svg)](https://arxiv.org/pdf/2404.06809.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://github.com/panruotong/CAG)|:heavy_minus_sign:| __`other.`__ (CAG)| 109 | |37|Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining|WACV(2024)|[![arXiv](https://img.shields.io/badge/arXiv-2311.03964-b31b1b.svg)](https://arxiv.org/pdf/2311.03964) |[![GitHub Page](https://img.shields.io/badge/GitHub-code-7395C5.svg)](https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html)|:heavy_minus_sign:| __`data.`__ | 110 | |38|Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction Tuning|arxiv(24.04)|[![arXiv](https://img.shields.io/badge/arXiv-2404.10332-b31b1b.svg)](https://arxiv.org/pdf/2404.10332.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`data.`__ | 111 | |39|TextSquare: Scaling up Text-Centric Visual Instruction Tuning|arxiv(24.04)|[![arXiv](https://img.shields.io/badge/arXiv-2404.12803-b31b1b.svg)](https://arxiv.org/pdf/2404.12803.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| __`data.`__ | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | ### Others 120 | Here are some papers that are not directly related to MLLM hallucinations, but may have unexpected inspiration for you. 121 | | **Number** | **Title** | **Venue** | **Paper** | **Repo** | **Citation** | 122 | |:--------:|:--------:|:---------:|:---------:|:---------:|:---------:| 123 | |1|Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts|ICML(2022)|[![arXiv](https://img.shields.io/badge/arXiv-2111.08276-b31b1b.svg)](https://arxiv.org/pdf/2111.08276.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/zengyan-97/X-VLM)|:fire:| 124 | |2|Locating and Editing Factual Associations in GPT|NeurIPS(2022)|[![arXiv](https://img.shields.io/badge/arXiv-2111.08276-b31b1b.svg)](https://arxiv.org/pdf/2202.05262.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/kmeng01/rome)|:fire:| 125 | |3|Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances|COLING(2022)|[![arXiv](https://img.shields.io/badge/arXiv-2209.08529-b31b1b.svg)](https://arxiv.org/pdf/2209.08529.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/wyk-nku/Distinguishing-VQA)|:heavy_minus_sign:| 126 | |4|Hallucination improves the performance of unsupervised visual representation learning|ICCV(2023)|[![arXiv](https://img.shields.io/badge/arXiv-2307.12168-b31b1b.svg)](https://arxiv.org/pdf/2307.12168.pdf) |:heavy_minus_sign:|:heavy_minus_sign:| 127 | |5|Direct Preference Optimization: Your Language Model is Secretly a Reward Model|NeurIPS(2023)|[![arXiv](https://img.shields.io/badge/arXiv-2305.18290-b31b1b.svg)](https://arxiv.org/pdf/2305.18290.pdf) |:heavy_minus_sign:|:fire:| 128 | |6|A Survey on Multimodal Large Language Models| arxiv(23.06) | [![arXiv](https://img.shields.io/badge/arXiv-2306.03514-b31b1b.svg)](https://arxiv.org/pdf/2306.03514.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)|:fire:| 129 | |7|Recognize Anything: A Strong Image Tagging Model| arxiv(23.06) | [![arXiv](https://img.shields.io/badge/arXiv-2306.13549-b31b1b.svg)](https://arxiv.org/pdf/2306.13549.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://recognize-anything.github.io/)|:star:| 130 | |8|RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback|arxiv(23.09)|[![arXiv](https://img.shields.io/badge/arXiv-2309.00267-b31b1b.svg)](https://arxiv.org/pdf/2309.00267.pdf) |:heavy_minus_sign:|:fire:| 131 | |9|Cognitive Mirage: A Review of Hallucinations in Large Language Models|arxiv(23.09)|[![arXiv](https://img.shields.io/badge/arXiv-2309.06794-b31b1b.svg)](https://arxiv.org/pdf/2309.06794.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/hongbinye/Cognitive-Mirage-Hallucinations-in-LLMs)|:large_blue_diamond:| 132 | |10|The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)|arxiv(23.09)|[![arXiv](https://img.shields.io/badge/arXiv-2309.17421-b31b1b.svg)](https://arxiv.org/pdf/2309.17421.pdf) |:heavy_minus_sign:|:fire:| 133 | |11|Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models|arxiv(23.11)|[![arXiv](https://img.shields.io/badge/arXiv-2311.06607-b31b1b.svg)](https://arxiv.org/pdf/2311.06607.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/Yuliang-Liu/Monkey)|:large_blue_diamond:| 134 | |12|Polos: Multimodal Metric Learning from Human Feedback for Image Captioning|CVPR(2024)|[![arXiv](https://img.shields.io/badge/arXiv-2402.18091-b31b1b.svg)](https://arxiv.org/pdf/2402.18091.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/keio-smilab24/polos)|:heavy_minus_sign:| 135 | |13|Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections|arxiv(24.02)|[![arXiv](https://img.shields.io/badge/arXiv-2402.16973-b31b1b.svg)](https://arxiv.org/pdf/2402.16973.pdf) |[![GitHub Page](https://img.shields.io/badge/GitHub-Code-7395C5.svg)](https://github.com/lingjunzhao/HEAR)|:heavy_minus_sign:| 136 | --------------------------------------------------------------------------------