├── README.md
├── asset
├── Highlights-1.png
├── Highlights-2.png
├── Highlights-3.png
├── Highlights-4.png
├── name_logo.jpg
├── results_of_question_type.png
├── results_of_question_types_0616.png
├── results_of_various_models.png
├── results_of_video_sub_type.png
├── results_of_video_type.jpg
└── sta.jpg
└── evaluation
└── output_test_template.json
/README.md:
--------------------------------------------------------------------------------
1 | # Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
2 |
3 | 
4 | 
5 | 
6 | 
7 | 
8 | 
9 |
10 |
11 |
12 |
13 |
14 | [[🍎 Project Page](https://video-mme.github.io/)] [[📖 arXiv Paper](https://arxiv.org/pdf/2405.21075)] [[📊 Dataset](https://github.com/BradyFU/Video-MME?tab=readme-ov-file#-dataset)] [[📖 MME-Survey](https://arxiv.org/pdf/2411.15296)] [[🏆 Leaderboard](https://video-mme.github.io/home_page.html#leaderboard)]
15 |
16 | Video-MME applies to both **image MLLMs**, i.e., generalizing to multiple images, and **video MLLMs**. 🌟
17 |
18 | We are very proud to launch [**MME-Survey**](https://arxiv.org/pdf/2411.15296) (jointly introduced by **MME**, **MMBench**, and **LLaVA** teams), a comprehensive survey on evaluation of Multimodal LLMs! 🔥🔥
19 |
20 |
21 | ---
22 |
23 | ## 🔥 News
24 | * **`2025.05.06`** 🌟 [**Gemini 2.5 Pro**](https://developers.googleblog.com/en/gemini-2-5-pro-io-improved-coding-performance/) has used our Video-MME as the benchmark of video understanding: "Gemini 2.5 Pro delivers state-of-the-art video understanding, scoring 84.8% on the VideoMME benchmark".
25 | * **`2025.04.14`** 🌟 Video-MME has been introduced and used by [**OpenAI GPT-4.1**](https://openai.com/index/gpt-4-1/) as an **"industry standard measure"** of long context ability.
26 | * **`2025.02.27`** 🌟 Video-MME has been accepted by CVPR 2025.
27 | * **`2024.06.15`** 🌟 We have refreshed our evaluation: 1) replace broken and potentially broken video links, and re-annotated them; 2) GPT-4o now samples 384 frames (previously 10 from the website) at 512x512 resolution, boosting overall accuracy to 71.9%.
28 | * **`2024.06.03`** 🌟 We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis!
29 |
30 |
31 |
32 | ## 👀 Video-MME Overview
33 |
34 | In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements, but their potential in processing sequential visual data is still insufficiently explored. We introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. It is designed to comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities. Video-MME comprises **900 videos** with a total of 254 hours, and **2,700 human-annotated question-answer pairs**. Our work distinguishes from existing benchmarks through four key features:
35 | * *Duration in temporal dimension*. Encompassing both **short- (< 2min)**, **medium- (4min\~15min)**, and **long-term (30min\~60min)** videos, ranging from **11 seconds to 1 hour**, for robust contextual dynamics;
36 | * *Diversity in video types*. Spanning **6 primary visual domains**, i.e., Knowledge, Film & Television, Sports Competition, Life Record, and Multilingual, with **30 subfields** to ensure broad scenario generalizability;
37 | * *Breadth in data modalities*. Integrating multi-modal inputs besides video frames, including **subtitles and audios**, to assess the all-round capabilities of MLLMs;
38 | * *Quality in annotations*. **All data are newly collected and annotated by humans, not from any existing video dataset**, ensuring diversity and quality.
39 |
40 |
41 |
42 |
43 |
44 |
45 | ## 📐 Dataset Examples
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 | Click to expand more examples
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 | ## 🔍 Dataset
63 |
64 | **License**:
65 | ```
66 | Video-MME is only used for academic research. Commercial use in any form is prohibited.
67 | The copyright of all videos belongs to the video owners.
68 | If there is any infringement in Video-MME, please email videomme2024@gmail.com and we will remove it immediately.
69 | Without prior approval, you cannot distribute, publish, copy, disseminate, or modify Video-MME in whole or in part.
70 | You must strictly comply with the above restrictions.
71 | ```
72 |
73 | Please send an email to **videomme2024@gmail.com**. 🌟
74 |
75 |
76 | ## 🔮 Evaluation Pipeline
77 | 📍 **Extract Frames and Subtitles**:
78 |
79 | There are a total of **900 videos** and **744 subtitles**, where all long videos have subtitles.
80 |
81 | With respect to the setting of adding subtitles, you should only use the subtitles corresponding to the sampled video frames.
82 | For example, if you extract 10 frames per video for evaluation, take the 10 subtitles that corresponding to the time of those 10 frames.
83 |
84 | If you have already prepared the video and subtitle file, you could refer to [this script](https://github.com/look4u-ok/video-slicer) to extract the frames and corresponding subtitles.
85 |
86 |
87 | 📍 **Prompt**:
88 |
89 | The common prompt used in our evaluation follows this format:
90 |
91 | ```
92 | This video's subtitles are listed below:
93 | [Subtitles]
94 | Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
95 | [Question]
96 | The best answer is:
97 | ```
98 |
99 | For the subtitles-free setting, you should remove the subtitle content.
100 |
101 |
102 |
103 | Click to expand the prompt examples.
104 |
105 | * With subtitles:
106 |
107 | ```
108 | This video's subtitles are listed below:
109 | Hi guys, I'm going to show you how to perfectly prepare a ...
110 | Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
111 | What is the color of the clothing worn by the persons in the video?
112 | A. Black.
113 | B. Gray.
114 | C. Green.
115 | D. Brown.
116 | The best answer is:
117 | ```
118 |
119 | * Without subtitles:
120 | ```
121 | Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
122 | What is the color of the clothing worn by the persons in the video?
123 | A. Black.
124 | B. Gray.
125 | C. Green.
126 | D. Brown.
127 | The best answer is:
128 | ```
129 |
130 |
131 |
132 | 📍 **Evaluation**:
133 |
134 | To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template [output_test_template.json](./evaluation/output_test_template.json). Once you have prepared the model responses in this format, please refer to the evaluation script [eval_your_results.py](https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py), and you will get the accuracy scores across video_durations, video domains, video subcategories, and task types.
135 | The evaluation does not introduce any third-party models, such as ChatGPT.
136 |
137 | ```bash
138 | python eval_your_results.py \
139 | --results_file $YOUR_RESULTS_FILE \
140 | --video_duration_type $VIDEO_DURATION_TYPE \
141 | --return_categories_accuracy \
142 | --return_sub_categories_accuracy \
143 | --return_task_types_accuracy
144 | ```
145 | Please ensure that the `results_file` follows the specified JSON format stated above, and `video_duration_type` is specified as either `short`, `medium`, or `long`. If you wish to assess results across various duration types, you can specify multiple types separated by commas or organize them in a list, for example: `short,medium,long` or `["short","medium","long"]`.
146 |
147 | 📍 **Leaderboard**:
148 |
149 | If you want to add your model to our [leaderboard](https://video-mme.github.io/home_page.html#leaderboard), please send model responses to **bradyfu24@gmail.com**, as the format of [output_test_template.json](./evaluation/output_test_template.json).
150 |
151 |
152 | ## 📈 Experimental Results
153 | - **Evaluation results of different MLLMs.**
154 |
155 |
156 |
157 |
158 |
159 |
160 | - **Evaluation results of different MLLMs across different task types.**
161 |
162 |
163 |
164 |
165 |
166 | - **Evaluation results of Gemini 1.5 Pro across different video duration types.**
167 |
168 |
169 |
170 |
171 |
172 | - **Evaluation results of Gemini 1.5 Pro across different video sub-types.**
173 |
174 |
175 |
176 |
177 |
178 |
179 | ## :black_nib: Citation
180 |
181 | If you find our work helpful for your research, please consider citing our work.
182 |
183 | ```bibtex
184 | @article{fu2024video,
185 | title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
186 | author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
187 | journal={arXiv preprint arXiv:2405.21075},
188 | year={2024}
189 | }
190 |
191 | @article{fu2023mme,
192 | title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
193 | author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
194 | journal={arXiv preprint arXiv:2306.13394},
195 | year={2023}
196 | }
197 |
198 | @article{fu2024mme,
199 | title={MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs},
200 | author={Fu, Chaoyou and Zhang, Yi-Fan and Yin, Shukang and Li, Bo and Fang, Xinyu and Zhao, Sirui and Duan, Haodong and Sun, Xing and Liu, Ziwei and Wang, Liang and others},
201 | journal={arXiv preprint arXiv:2411.15296},
202 | year={2024}
203 | }
204 |
205 | @article{zhang2024mme,
206 | title={MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?},
207 | author={Zhang, Yi-Fan and Zhang, Huanyu and Tian, Haochen and Fu, Chaoyou and Zhang, Shuangqing and Wu, Junfei and Li, Feng and Wang, Kun and Wen, Qingsong and Zhang, Zhang and others},
208 | journal={arXiv preprint arXiv:2408.13257},
209 | year={2024}
210 | }
211 | ```
212 |
213 | ## 📜 Related Works
214 |
215 | Explore our related researches:
216 | - **[MME-Survey]** [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https://arxiv.org/pdf/2411.15296)
217 | - **[MME]** [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https://arxiv.org/pdf/2306.13394)
218 | - **[MME-RealWorld]** [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://arxiv.org/pdf/2408.13257)
219 | - **[Awesome-MLLM]** [A Survey on Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)
220 |
221 |
--------------------------------------------------------------------------------
/asset/Highlights-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-1.png
--------------------------------------------------------------------------------
/asset/Highlights-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-2.png
--------------------------------------------------------------------------------
/asset/Highlights-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-3.png
--------------------------------------------------------------------------------
/asset/Highlights-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-4.png
--------------------------------------------------------------------------------
/asset/name_logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/name_logo.jpg
--------------------------------------------------------------------------------
/asset/results_of_question_type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_question_type.png
--------------------------------------------------------------------------------
/asset/results_of_question_types_0616.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_question_types_0616.png
--------------------------------------------------------------------------------
/asset/results_of_various_models.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_various_models.png
--------------------------------------------------------------------------------
/asset/results_of_video_sub_type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_video_sub_type.png
--------------------------------------------------------------------------------
/asset/results_of_video_type.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_video_type.jpg
--------------------------------------------------------------------------------
/asset/sta.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/sta.jpg
--------------------------------------------------------------------------------
/evaluation/output_test_template.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "video_id": "001",
4 | "duration": "short",
5 | "domain": "Knowledge",
6 | "sub_category": "Humanity & History",
7 | "questions": [
8 | {
9 | "question_id": "001-1",
10 | "task_type": "Counting Problem",
11 | "question": "When demonstrating the Germany modern Christmas tree is initially decorated with apples, candles and berries, which kind of the decoration has the largest number?",
12 | "options": [
13 | "A. Apples.",
14 | "B. Candles.",
15 | "C. Berries.",
16 | "D. The three kinds are of the same number."
17 | ],
18 | "answer": "C",
19 | "response": "C. Berries.",
20 | },
21 | {
22 | "question_id": "001-2",
23 | "task_type": "Information Synopsis",
24 | "question": "What is the genre of this video?",
25 | "options": [
26 | "A. It is a news report that introduces the history behind Christmas decorations.",
27 | "B. It is a documentary on the evolution of Christmas holiday recipes.",
28 | "C. It is a travel vlog exploring Christmas markets around the world.",
29 | "D. It is a tutorial on DIY Christmas ornament crafting."
30 | ],
31 | "answer": "A",
32 | "response": "D.",
33 | },
34 | {
35 | "question_id": "001-3",
36 | "task_type": "Counting Problem",
37 | "question": "How many red socks are above the fireplace at the end of this video?",
38 | "options": [
39 | "A. 1.",
40 | "B. 4.",
41 | "C. 2.",
42 | "D. 3."
43 | ],
44 | "answer": "D",
45 | "response": "D. 3",
46 | }
47 | ]
48 | },
49 | {
50 | "video_id": "002",
51 | "duration": "short",
52 | "domain": "Knowledge",
53 | "sub_category": "Humanity & History",
54 | "questions": [
55 | {
56 | "question_id": "002-1",
57 | "task_type": "Object Recognition",
58 | "question": "Which of the following features/items is not discussed in the video in relation to the tomb?",
59 | "options": [
60 | "A. Inkstone.",
61 | "B. Niche.",
62 | "C. Jade.",
63 | "D. Sacrificial table."
64 | ],
65 | "answer": "C",
66 | "response": "Answer: C. Jade.",
67 | },
68 | {
69 | "question_id": "002-2",
70 | "task_type": "Action Reasoning",
71 | "question": "Which of the following reasons motivated the archaeologists to excavate the tomb?",
72 | "options": [
73 | "A. Because it's from Ming Dynasty and of specific archaeological significance.",
74 | "B. Because a new railway line will be built nearby.",
75 | "C. Because there were treasures inside the tomb.",
76 | "D. Highway realignment."
77 | ],
78 | "answer": "D",
79 | "response": "D",
80 | },
81 | {
82 | "question_id": "002-3",
83 | "task_type": "Counting Problem",
84 | "question": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?",
85 | "options": [
86 | "A. 4.",
87 | "B. 9.",
88 | "C. 5.",
89 | "D. 13."
90 | ],
91 | "answer": "B",
92 | "response": "D. 13",
93 | }
94 | ]
95 | },
96 | {
97 | "video_id": "003",
98 | "duration": "short",
99 | "domain": "Knowledge",
100 | "sub_category": "Humanity & History",
101 | "questions": [
102 | {
103 | "question_id": "003-1",
104 | "task_type": "Counting Problem",
105 | "question": "How many national flags appear in the video?",
106 | "options": [
107 | "A. 3.",
108 | "B. 4.",
109 | "C. 2.",
110 | "D. 5."
111 | ],
112 | "answer": "B",
113 | "response": "B",
114 | },
115 | {
116 | "question_id": "003-2",
117 | "task_type": "Object Recognition",
118 | "question": "What is the video telling when the burger placed in the upper right corner at the end of the video first appears?",
119 | "options": [
120 | "A. Beef with spices came from Russia to Germany.",
121 | "B. The steak began to be sandwiched between two pieces of bread.",
122 | "C. Steak burgers spread throughout the United States.",
123 | "D. The standardization of hamburgers."
124 | ],
125 | "answer": "C",
126 | "response": "C.",
127 | },
128 | {
129 | "question_id": "003-3",
130 | "task_type": "Object Reasoning",
131 | "question": "In which country is the food featured in the video recognized worldwide?",
132 | "options": [
133 | "A. Mongolia.",
134 | "B. Russia.",
135 | "C. Germany.",
136 | "D. United States."
137 | ],
138 | "answer": "D",
139 | "response": "D. United States.",
140 | }
141 | ]
142 | },
143 | ]
144 |
--------------------------------------------------------------------------------