└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-Multimodal-Reasoning
2 |
3 | **Contributions are most welcome**, if you have any suggestions or improvements, feel free to create an issue or raise a pull request.
4 |
5 |
6 | ## Contents
7 | - [Model](#model)
8 | - [Image MLLM](#image-mllm)
9 | - [Video MLLM](#video-mllm)
10 | - [Audio MLLM](#audio-mllm)
11 | - [Image/Video Generation](#imagevideo-generation)
12 | - [LLM](#llm)
13 | - [Benchmark](#benchmark)
14 | - [Data](#data)
15 | - [Survey Section](#survey)
16 |
17 |
18 | ## Model
19 |
20 | ### Image MLLM
21 |
22 | | Date | Project | SFT | RL | Task |
23 | | ---------------- | ------------------------------------------------------------ | ---------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
24 | | 25.03 | Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [[📑Paper]](https://arxiv.org/abs/2503.20752)[[🖥️Code]](https://tanhuajie.github.io/ReasonRFT) | SFT-based Reasoning Activation | GRPO | Visual Count, Structure Perception, Spatial Transformation |
25 | | 25.03 | Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [[📑Paper]](https://arxiv.org/abs/2503.21696)[[🖥️Code]](https://github.com/zwq2018/embodied_reasoner) | 9.3k observation-thought-action trajectories | - | Interactive Embodied Tasks |
26 | | 25.03 | OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [[📑Paper]]([https://arxiv.org/pdf/2503.18013v1](https://arxiv.org/pdf/2503.17352v1))[[🖥️Code]](https://github.com/yihedeng9/OpenVLThinker) | Iterative SFT | Iterative GRPO | Various VQA|
27 | | 25.03 | Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning [[📑Paper]](https://arxiv.org/pdf/2503.18013v1)[[🖥️Code]](https://github.com/jefferyZhan/Griffon/tree/master/Vision-R1) | - | GRPO | Object localization |
28 | | 25.03 | R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [[📑Paper]](https://arxiv.org/abs/2503.12937)[[🖥️Code]](https://github.com/jingyi0000/R1-VL) | Mulberry-260k | StepGRPO with 10k data from Mulberry-260k | Various VQA |
29 | | 25.03 | MetaSpatial [[🖥️Code]](https://github.com/PzySeere/MetaSpatial) | - | GRPO | 3D spatial reasoning |
30 | | 25.03 | CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation [[📑Paper]](https://arxiv.org/pdf/2503.05255) | 260k sft | - | Multi-Image Benchmark |
31 | | 25.03 | VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [[📑Paper]](https://arxiv.org/abs/2503.10291)[[model]](https://huggingface.co/OpenGVLab/VisualPRM-8B)[[data]](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K)[[benchmark]](https://huggingface.co/datasets/OpenGVLab/VisualProcessBench) | - | Process Reward Model | Math & MMMU |
32 | | 25.03 | R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [[📑Paper]](https://arxiv.org/pdf/2503.10615)[[Project website]](https://yangyi-vai.notion.site/r1-onevision)[[🖥️Code]](https://github.com/Fancy-MLLM/R1-Onevision) | - | 155k R1-OneVision
GRPO | Math |
33 | | 25.03 | MMR1: Advancing the Frontiers of Multimodal Reasoning [[🖥️Code]](https://github.com/LengSicong/MMR1) | - | GRPO | Math |
34 | | 25.03 (CVPR2025) | GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [[📑Paper]](https://arxiv.org/pdf/2503.06514) | - | GFlowNets | NumberLine (NL) and BlackJack (BJ) |
35 | | 25.03 | VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [[📑Paper]](https://arxiv.org/pdf/2503.07523)[[🖥️Code]](https://github.com/zhangquanchen/VisRL) | warm up | DPO | Various VQA |
36 | | 25.03 | Visual-RFT: Visual Reinforcement Fine-Tuning [[📑Paper]](https://arxiv.org/abs/2503.01785)[[🖥️Code]](https://github.com/Liuziyu77/Visual-RFT) | - | GRPO | Detection, Grounding, Classification |
37 | | 25.03 | LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [[📑Paper]](https://link.zhihu.com/?target=https%3A//arxiv.org/pdf/2503.07536)[[🖥️Code]](https://github.com/TideDra/lmm-r1) | - | PPO | Math, Sokoban-Global, Football-Online |
38 | | 25.03 | Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [[📑Paper]](https://arxiv.org/pdf/2503.07065) | Self-Improvement Training | GRPO | Detection, Classification, Math |
39 | | 25.03 | Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [[📑Paper]](https://arxiv.org/abs/2503.06749)[[🖥️Code]](https://github.com/Osilly/Vision-R1) | - | GRPO | Math |
40 | | 25.03 | Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [[📑Paper]](https://arxiv.org/abs/2503.06520)[[🖥️Code]](https://github.com/dvlab-research/Seg-Zero) | - | GRPO | RefCOCO&ReasonSeg |
41 | | 25.03 | R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model [[📑Paper]](https://arxiv.org/abs/2503.05132)[[🖥️Code]](https://github.com/turningpoint-ai/VisualThinker-R1-Zero) | - | GRPO | CVBench |
42 | | 25.03 | MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [[📑Paper]](https://arxiv.org/abs/2503.07365)[[🖥️Code]](https://github.com/ModalMinds/MM-EUREKA) | - | 54.9k general science/math/chart QA
RLOO | Math |
43 | | 25.03 | Unified Reward Model for Multimodal Understanding and Generation [[📑Paper]](https://arxiv.org/abs/2503.05236)[[🖥️Code]](https://codegoat24.github.io/UnifiedReward/) | - | DPO | Various VQA & Generation |
44 | | 25.03 | EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [[🖥️Code]](https://github.com/hiyouga/EasyR1) | - | GRPO | Geometry3K |
45 | | 25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [[📑Paper]](https://arxiv.org/abs/2502.10391)[[🖥️Code]](https://mm-rlhf.github.io/) | - | DPO with 120k fine-grained, human-annotated preference comparison pairs. | Reward & Various VQA |
46 | | 25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [[📑Paper]](https://arxiv.org/abs/2502.18411)[[🖥️Code]](https://github.com/PhoenixZ810/OmniAlign-V) | 200k sft data | DPO | Alignment & Various VQA |
47 | | 25.02 | Multimodal Open R1 [[🖥️Code]](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) | - | GRPO | Mathvista-mini, MMMU |
48 | | 25.02 | VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [[🖥️Code]](https://github.com/om-ai-lab/VLM-R1/tree/main?tab=readme-ov-file) | - | GRPO | Referring Expression Comprehension |
49 | | 25.02 | R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [[🖥️Code]](https://github.com/Deep-Agent/R1-V) | - | GRPO | Item Counting, Number Related Reasoning and Geometry Reasoning |
50 | | 25.01 | Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [[📑Paper]](https://arxiv.org/abs/2501.01904)[[🖥️Code]](https://github.com/RUCAIBox/Virgo) | 2k Text data from R1/QwQ and visual data from QvQ/SD | - | Math & MMMU |
51 | | 25.01 | InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [[📑Paper]](https://arxiv.org/abs/2501.12368)[[🖥️Code]](https://github.com/InternLM/InternLM-XComposer) | - | PPO | Reward & Various VQA |
52 | | 25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [[📑Paper]](https://arxiv.org/abs/2501.06186)[[🖥️Code]](https://github.com/mbzuai-oryx/LlamaV-o1) | LLaVA-CoT-100k & PixMo [13] subset | - | VRC-Bench & Various VQA |
53 | | 24.12 | Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [[📑Paper]](https://arxiv.org/abs/2412.18319)[[🖥️Code]](https://github.com/HJYao00/Mulberry) | 260k reasoning and reflection sft data by Collective MCTS | - | Various VQA |
54 | | 24.11 | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [[📑Paper]](https://arxiv.org/abs/2411.10440)[[🖥️Code]](https://github.com/PKU-YuanGroup/LLaVA-CoT) | LLaVA-CoT-100k by GPT4-o | - | Various VQA |
55 | | 24.11 | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [[📑Paper]](https://arxiv.org/abs/2411.14432)[[🖥️Code]](https://github.com/dongyh20/Insight-V) | sft for agent | Iterative DPO | Various VQA |
56 | | 24.11 | Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [[📑Paper]](https://arxiv.org/abs/2411.10442) | - | MPO | Various VQA |
57 | | 24.10 | Improve Vision Language Model Chain-of-thought Reasoning [[📑Paper]](https://arxiv.org/pdf/2410.16198)[[🖥️Code]](https://github.com/RifleZhang/LLaVA-Reasoner-DPO) | 193k CoT sft data by GPT4-o | DPO | Various VQA |
58 | | 24.03 | Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [[📑Paper]](https://proceedings.neurips.cc/paper_files/paper/2024/file/0ff38d72a2e0aa6dbe42de83a17b2223-Paper-Datasets_and_Benchmarks_Track.pdf)[[🖥️Code]](https://github.com/deepcs233/Visual-CoT) | visual chain-of-thought dataset comprising 438k data items | - | Various VQA |
59 |
60 |
61 |
62 |
63 |
64 | ### Video MLLM
65 |
66 | | Date | Project | SFT | RL | Task |
67 | | ----- | ------------------------------------------------------------ | ------------- | ---- | ------------------- |
68 | | 25.04.10 | VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning[[📑Paper]](https://arxiv.org/pdf/2504.06958)[[🖥️Code]](https://github.com/OpenGVLab/VideoChat-R1) | - | GRPO | general video understanding,video temporal grounding, object tracking, QA and grounding QA tasks,video captioning and video quality access |
69 | | 25.03 | Open-LLaVA-Video-R1[[🖥️Code]](https://github.com/Hui-design/Open-LLaVA-Video-R1) | - | GRPO | DVD-counting |
70 | | 25.03 | TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [[📑Paper]](https://arxiv.org/abs/2503.05379)[[🖥️Code]](https://github.com/www-Ye/TimeZero) | - | GRPO | Temporal Grounding |
71 | | 25.03 | R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [[📑Paper]](https://arxiv.org/abs/2503.05379)[[🖥️Code]](https://github.com/HumanMLLM/R1-Omni) | cold start | GRPO | Emotion recognition |
72 | | 25.02 | video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [[📑Paper]](https://arxiv.org/abs/2502.11775) | cold start | DPO | various video QA |
73 | | 25.02 | Open-R1-Video[[🖥️Code]](https://github.com/Wang-Xiaodong1899/Open-R1-Video) | - | GRPO | LongVideoBench |
74 | | 25.02 | Video-R1: Towards Super Reasoning Ability in Video Understanding [[🖥️Code]](https://github.com/tulerfeng/Video-R1) | - | GRPO | DVD-counting |
75 | | 25.01 | Temporal Preference Optimization for Long-Form Video Understanding [[📑Paper]](https://arxiv.org/abs/2501.13919)[[🖥️Code]](https://ruili33.github.io/tpo_website/) | - | DPO | various video QA |
76 | | 25.01 | Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [[📑Paper]](https://arxiv.org/abs/2501.07888)[[🖥️Code]]() | main training | DPO | Video caption & QA |
77 |
78 | ### Audio MLLM
79 |
80 | | Date | Project | SFT | RL | Task |
81 | | ----- | ------------------------------------------------------------ | ------------- | ---- | ------------------- |
82 | | 25.03 | Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [[📑Paper]](https://arxiv.org/abs/2503.11197)[[🖥️Code]](https://github.com/xiaomi-research/r1-aqa) | - | GRPO | AudioQA |
83 |
84 |
85 | ### Image/Video Generation
86 |
87 | | Date | Proj | Comment |
88 | | ----- | ------------------------------------------------------------ | ---------------------------------------------------------- |
89 | | 25.03 | GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing[[📑Paper]](https://arxiv.org/pdf/2503.10639) | A reasoning-guided framework for generation and editing. |
90 | | 25.02 | C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation[[📑Paper]](https://arxiv.org/pdf/2502.19868) | Calculate simple motion vector with LLM. |
91 | | 25.01 | Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step[[📑Paper]](https://arxiv.org/pdf/2501.13926) | Potential Assessment Reward Model for AR Image Generation. |
92 | | 25.01 | Imagine while Reasoning in Space: Multimodal Visualization-of-Thought[[📑Paper]](https://arxiv.org/pdf/2501.07542) | Visualization-of-Thought |
93 | | 25.01 | ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding[[📑Paper]](https://arxiv.org/pdf/2501.05452) | Draw something! |
94 | | 24.12 | EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing[[📑Paper]](https://arxiv.org/pdf/2412.10566) | Thinking in text space with a caption model. |
95 |
96 |
97 |
98 | ### LLM
99 |
100 | | Date | Project | Comment |
101 | | ----- | ------------------------------------------------------------ | ------- |
102 | | 25.04.09 | VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks[[📑Paper]](https://arxiv.org/pdf/2504.05118) | |
103 | | 25.03 | DAPO: An Open-Source LLM Reinforcement Learning System at Scale[[📑Paper]](https://arxiv.org/pdf/2503.14476)[[Project]](https://dapo-sia.github.io/) | DAPO-Math-17K |
104 | | 23.02 | Multimodal Chain-of-Thought Reasoning in Language Models [[📑Paper]](https://arxiv.org/abs/2302.00923) [[🖥️Code]](https://github.com/amazon-science/mm-cot) | |
105 |
106 |
107 |
108 | ## Benchmark
109 |
110 | | Date | Project | Task |
111 | | ----- | ------------------------------------------------------------ | ------------------------------------------------------------ |
112 | | 25.03 | Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1[[📑Paper]](https://arxiv.org/pdf/2503.24376) | Video understanding(perception and reasoing) |
113 | | 25.03 | Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study[[📑Paper]](https://arxiv.org/pdf/2503.16788)[[Data]](https://github.com/LlamaTouch/VLM-Reasoning-Traces) | static and dynamic mobile GUI benchmarks (ScreenSpot, AndroidControl, and AndroidWorld) |
114 | | 25.03 | SCIVERSE: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems [[📑Paper]](https://arxiv.org/pdf/2503.10627) | SCIVERSE |
115 | | 25.03 | Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning [[📑Paper]](https://arxiv.org/pdf/2503.06232)[[Data]](https://huggingface.co/datasets/Battam/3D-CoT) | 3D-CoT |
116 | | 25.02 | MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [[📑Paper]](https://arxiv.org/pdf/2502.00698)[[🖥️Code]](https://github.com/AceCHQ/MMIQ) | MM-IQ |
117 | | 25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [[📑Paper]](https://arxiv.org/abs/2502.10391) | MM-RLHF-RewardBench, MM-RLHF-SafetyBench |
118 | | 25.02 | MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency [[📑Paper]](https://arxiv.org/pdf/2502.09621)[[🖥️Code]](https://github.com/CaraJ7/MME-CoT) | MME-CoT |
119 | | 25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [[📑Paper]](https://arxiv.org/abs/2502.18411)[[🖥️Code]](https://github.com/PhoenixZ810/OmniAlign-V) | MM-AlignBench |
120 | | 25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [[📑Paper]](https://arxiv.org/abs/2501.06186)[[🖥️Code]](https://github.com/mbzuai-oryx/LlamaV-o1) | VRCBench |
121 | | 24.11 | VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [[📑Paper]](https://arxiv.org/abs/2411.17451) | VLRewardBench |
122 | | 24.05 | M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [[📑Paper]](https://arxiv.org/html/2405.16473v1) | M3CoT |
123 |
124 |
125 |
126 | ## Data
127 |
128 | | Date | Project | Comment |
129 | | ----- | ------------------------------------------------------------ | ---------------- |
130 | | 24.11 | VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection[[📑Paper]](https://arxiv.org/abs/2411.14794)[[🖥️Code]](https://github.com/hshjerry/VideoEspresso) | various video QA |
131 |
132 | ## Survey
133 |
134 | | Date | Project | Comment |
135 | | ----- | ------------------------------------------------------------ | ------- |
136 | | 25.04 | A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems[[📑Paper]](https://arxiv.org/abs/2504.09037) | |
137 | | 25.03 | A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond[[📑Paper]](https://arxiv.org/abs/2503.21614) | |
138 | | 25.03 | Aligning Multimodal LLM with Human Preference: A Survey[[📑Paper]](https://arxiv.org/abs/2503.14504) | |
139 |
--------------------------------------------------------------------------------