├── README.md
├── asset
    ├── Highlights-1.png
    ├── Highlights-2.png
    ├── Highlights-3.png
    ├── Highlights-4.png
    ├── name_logo.jpg
    ├── results_of_question_type.png
    ├── results_of_question_types_0616.png
    ├── results_of_various_models.png
    ├── results_of_video_sub_type.png
    ├── results_of_video_type.jpg
    └── sta.jpg
└── evaluation
    └── output_test_template.json


/README.md:
--------------------------------------------------------------------------------
  1 | # Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
  2 | 
  3 | ![VideoQA](https://img.shields.io/badge/Task-VideoQA-red) 
  4 | ![Multi-Modal](https://img.shields.io/badge/Task-Multi--Modal-red) 
  5 | ![Video-MME](https://img.shields.io/badge/Dataset-Video--MME-blue)  
  6 | ![Gemini](https://img.shields.io/badge/Model-Gemini-green) 
  7 | ![GPT-4V](https://img.shields.io/badge/Model-GPT--4V-green) 
  8 | ![GPT-4o](https://img.shields.io/badge/Model-GPT--4o-green)
  9 | 
 10 | <p align="center">
 11 |     <img src="./asset/name_logo.jpg" width="100%" height="100%">
 12 | </p>
 13 | 
 14 | <font size=7><div align='center' > [[🍎 Project Page](https://video-mme.github.io/)] [[📖 arXiv Paper](https://arxiv.org/pdf/2405.21075)] [[📊 Dataset](https://github.com/BradyFU/Video-MME?tab=readme-ov-file#-dataset)] [[📖 MME-Survey](https://arxiv.org/pdf/2411.15296)] [[🏆 Leaderboard](https://video-mme.github.io/home_page.html#leaderboard)]  </div></font>
 15 | 
 16 | Video-MME applies to both **image MLLMs**, i.e., generalizing to multiple images, and **video MLLMs**. 🌟
 17 | 
 18 | We are very proud to launch [**MME-Survey**](https://arxiv.org/pdf/2411.15296) (jointly introduced by **MME**, **MMBench**, and **LLaVA** teams), a comprehensive survey on evaluation of Multimodal LLMs! 🔥🔥
 19 | 
 20 | 
 21 | ---
 22 | 
 23 | ## 🔥 News
 24 | * **`2025.05.06`** 🌟 [**Gemini 2.5 Pro**](https://developers.googleblog.com/en/gemini-2-5-pro-io-improved-coding-performance/) has used our Video-MME as the benchmark of video understanding: "Gemini 2.5 Pro delivers state-of-the-art video understanding, scoring 84.8% on the VideoMME benchmark".
 25 | * **`2025.04.14`** 🌟 Video-MME has been introduced and used by [**OpenAI GPT-4.1**](https://openai.com/index/gpt-4-1/)  as an **"industry standard measure"** of long context ability.
 26 | * **`2025.02.27`** 🌟 Video-MME has been accepted by CVPR 2025.
 27 | * **`2024.06.15`** 🌟 We have refreshed our evaluation: 1) replace broken and potentially broken video links, and re-annotated them; 2) GPT-4o now samples 384 frames (previously 10 from the website) at 512x512 resolution, boosting overall accuracy to 71.9%.
 28 | * **`2024.06.03`** 🌟 We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis!
 29 | 
 30 | 
 31 | 
 32 | ## 👀 Video-MME Overview
 33 | 
 34 | In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements, but their potential in processing sequential visual data is still insufficiently explored. We introduce <strong>Video-MME</strong>, the first-ever full-spectrum, <strong>M</strong>ulti-<strong>M</strong>odal <strong>E</strong>valuation benchmark of MLLMs in <strong>Video</strong> analysis. It is designed to comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities. Video-MME comprises **900 videos** with a total of 254 hours, and **2,700 human-annotated question-answer pairs**. Our work distinguishes from existing benchmarks through four key features: 
 35 | * *Duration in temporal dimension*. Encompassing both **short- (< 2min)**, **medium- (4min\~15min)**, and **long-term (30min\~60min)** videos, ranging from **11 seconds to 1 hour**, for robust contextual dynamics;
 36 | * *Diversity in video types*. Spanning **6 primary visual domains**, i.e., Knowledge, Film & Television, Sports Competition, Life Record, and Multilingual, with **30 subfields** to ensure broad scenario generalizability;
 37 | * *Breadth in data modalities*. Integrating multi-modal inputs besides video frames, including **subtitles and audios**, to assess the all-round capabilities of MLLMs;
 38 | * *Quality in annotations*. **All data are newly collected and annotated by humans, not from any existing video dataset**, ensuring diversity and quality. 
 39 | 
 40 | 
 41 | <p align="center">
 42 |     <img src="./asset/sta.jpg" width="100%" height="100%">
 43 | </p>
 44 | 
 45 | ## 📐 Dataset Examples
 46 | 
 47 | <p align="center">
 48 |     <img src="./asset/Highlights-2.png" width="100%" height="100%">
 49 | </p>
 50 | 
 51 | <div align='center' >
 52 | <details>
 53 | <summary> Click to expand more examples</summary>
 54 | <p align="center">
 55 |     <img src="./asset/Highlights-1.png" width="100%" height="100%">
 56 |     <img src="./asset/Highlights-3.png" width="100%" height="100%">
 57 |     <img src="./asset/Highlights-4.png" width="100%" height="100%">
 58 | </details>
 59 | </div>
 60 | 
 61 | 
 62 | ## 🔍 Dataset
 63 | 
 64 | **License**:
 65 | ```
 66 | Video-MME is only used for academic research. Commercial use in any form is prohibited.
 67 | The copyright of all videos belongs to the video owners.
 68 | If there is any infringement in Video-MME, please email videomme2024@gmail.com and we will remove it immediately.
 69 | Without prior approval, you cannot distribute, publish, copy, disseminate, or modify Video-MME in whole or in part. 
 70 | You must strictly comply with the above restrictions.
 71 | ```
 72 | 
 73 | Please send an email to **videomme2024@gmail.com**. 🌟
 74 | 
 75 | 
 76 | ## 🔮 Evaluation Pipeline
 77 | 📍 **Extract Frames and Subtitles**:
 78 | 
 79 | There are a total of **900 videos** and **744 subtitles**, where all long videos have subtitles.
 80 | 
 81 | With respect to the setting of adding subtitles, you should only use the subtitles corresponding to the sampled video frames.
 82 | For example, if you extract 10 frames per video for evaluation, take the 10 subtitles that corresponding to the time of those 10 frames.
 83 | 
 84 | If you have already prepared the video and subtitle file, you could refer to [this script](https://github.com/look4u-ok/video-slicer) to extract the frames and corresponding subtitles.
 85 | 
 86 | 
 87 | 📍 **Prompt**:
 88 | 
 89 | The common prompt used in our evaluation follows this format:
 90 | 
 91 | ```
 92 | This video's subtitles are listed below:
 93 | [Subtitles] 
 94 | Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option. 
 95 | [Question]
 96 | The best answer is:
 97 | ```
 98 | 
 99 | For the subtitles-free setting, you should remove the subtitle content.
100 | 
101 | 
102 | <details>
103 | <summary> Click to expand the prompt examples.</summary>
104 | 
105 | * With subtitles:
106 | 
107 | ```
108 | This video's subtitles are listed below:
109 | Hi guys, I'm going to show you how to perfectly prepare a ...
110 | Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
111 | What is the color of the clothing worn by the persons in the video?
112 | A. Black.
113 | B. Gray.
114 | C. Green.
115 | D. Brown.
116 | The best answer is:
117 | ```
118 | 
119 | * Without subtitles:
120 | ```
121 | Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
122 | What is the color of the clothing worn by the persons in the video?
123 | A. Black.
124 | B. Gray.
125 | C. Green.
126 | D. Brown.
127 | The best answer is:
128 | ```
129 | </details>
130 | 
131 | 
132 | 📍 **Evaluation**: 
133 | 
134 | To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template [output_test_template.json](./evaluation/output_test_template.json). Once you have prepared the model responses in this format, please refer to the evaluation script [eval_your_results.py](https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py), and you will get the accuracy scores across video_durations, video domains, video subcategories, and task types. 
135 | The evaluation does not introduce any third-party models, such as ChatGPT.
136 | 
137 | ```bash
138 | python eval_your_results.py \
139 |     --results_file $YOUR_RESULTS_FILE \
140 |     --video_duration_type $VIDEO_DURATION_TYPE \
141 |     --return_categories_accuracy \
142 |     --return_sub_categories_accuracy \
143 |     --return_task_types_accuracy
144 | ```
145 | Please ensure that the `results_file` follows the specified JSON format stated above, and `video_duration_type` is specified as either `short`, `medium`, or `long`. If you wish to assess results across various duration types, you can specify multiple types separated by commas or organize them in a list, for example: `short,medium,long` or `["short","medium","long"]`.
146 | 
147 | 📍 **Leaderboard**: 
148 | 
149 | If you want to add your model to our [leaderboard](https://video-mme.github.io/home_page.html#leaderboard), please send model responses to **bradyfu24@gmail.com**, as the format of [output_test_template.json](./evaluation/output_test_template.json).
150 | 
151 | 
152 | ## 📈 Experimental Results
153 | - **Evaluation results of different MLLMs.**
154 | 
155 | <p align="center">
156 |     <img src="./asset/results_of_various_models.png" width="96%" height="50%">
157 | </p>
158 | 
159 | 
160 | - **Evaluation results of different MLLMs across different task types.**
161 | 
162 | <p align="center">
163 |     <img src="./asset/results_of_question_types_0616.png" width="50%" height="50%">
164 | </p>
165 | 
166 | - **Evaluation results of Gemini 1.5 Pro across different video duration types.**
167 | 
168 | <p align="center">
169 |     <img src="./asset/results_of_video_type.jpg" width="96%" height="96%">
170 | </p>
171 | 
172 | - **Evaluation results of Gemini 1.5 Pro across different video sub-types.**
173 | 
174 | <p align="center">
175 |     <img src="./asset/results_of_video_sub_type.png" width="80%" height="80%">
176 | </p>
177 | 
178 | 
179 | ## :black_nib: Citation
180 | 
181 | If you find our work helpful for your research, please consider citing our work.   
182 | 
183 | ```bibtex
184 | @article{fu2024video,
185 |   title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
186 |   author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
187 |   journal={arXiv preprint arXiv:2405.21075},
188 |   year={2024}
189 | }
190 | 
191 | @article{fu2023mme,
192 |   title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
193 |   author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
194 |   journal={arXiv preprint arXiv:2306.13394},
195 |   year={2023}
196 | }
197 | 
198 | @article{fu2024mme,
199 |   title={MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs},
200 |   author={Fu, Chaoyou and Zhang, Yi-Fan and Yin, Shukang and Li, Bo and Fang, Xinyu and Zhao, Sirui and Duan, Haodong and Sun, Xing and Liu, Ziwei and Wang, Liang and others},
201 |   journal={arXiv preprint arXiv:2411.15296},
202 |   year={2024}
203 | }
204 | 
205 | @article{zhang2024mme,
206 |   title={MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?},
207 |   author={Zhang, Yi-Fan and Zhang, Huanyu and Tian, Haochen and Fu, Chaoyou and Zhang, Shuangqing and Wu, Junfei and Li, Feng and Wang, Kun and Wen, Qingsong and Zhang, Zhang and others},
208 |   journal={arXiv preprint arXiv:2408.13257},
209 |   year={2024}
210 | }
211 | ```
212 | 
213 | ## 📜 Related Works
214 | 
215 | Explore our related researches:
216 | -  **[MME-Survey]** [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https://arxiv.org/pdf/2411.15296)
217 | -  **[MME]** [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https://arxiv.org/pdf/2306.13394)
218 | -  **[MME-RealWorld]** [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https://arxiv.org/pdf/2408.13257)
219 | -  **[Awesome-MLLM]** [A Survey on Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)
220 | 
221 | 


--------------------------------------------------------------------------------
/asset/Highlights-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-1.png


--------------------------------------------------------------------------------
/asset/Highlights-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-2.png


--------------------------------------------------------------------------------
/asset/Highlights-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-3.png


--------------------------------------------------------------------------------
/asset/Highlights-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/Highlights-4.png


--------------------------------------------------------------------------------
/asset/name_logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/name_logo.jpg


--------------------------------------------------------------------------------
/asset/results_of_question_type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_question_type.png


--------------------------------------------------------------------------------
/asset/results_of_question_types_0616.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_question_types_0616.png


--------------------------------------------------------------------------------
/asset/results_of_various_models.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_various_models.png


--------------------------------------------------------------------------------
/asset/results_of_video_sub_type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_video_sub_type.png


--------------------------------------------------------------------------------
/asset/results_of_video_type.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/results_of_video_type.jpg


--------------------------------------------------------------------------------
/asset/sta.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MME-Benchmarks/Video-MME/8889f313a4b9e9480611d65daa505aab2ef63808/asset/sta.jpg


--------------------------------------------------------------------------------
/evaluation/output_test_template.json:
--------------------------------------------------------------------------------
  1 | [
  2 |     {
  3 |         "video_id": "001",
  4 |         "duration": "short",
  5 |         "domain": "Knowledge",
  6 |         "sub_category": "Humanity & History",
  7 |         "questions": [
  8 |             {
  9 |                 "question_id": "001-1",
 10 |                 "task_type": "Counting Problem",
 11 |                 "question": "When demonstrating the Germany modern Christmas tree is initially decorated with apples, candles and berries, which kind of the decoration has the largest number?",
 12 |                 "options": [
 13 |                     "A. Apples.",
 14 |                     "B. Candles.",
 15 |                     "C. Berries.",
 16 |                     "D. The three kinds are of the same number."
 17 |                 ],
 18 |                 "answer": "C",
 19 |                 "response": "C. Berries.",
 20 |             },
 21 |             {
 22 |                 "question_id": "001-2",
 23 |                 "task_type": "Information Synopsis",
 24 |                 "question": "What is the genre of this video?",
 25 |                 "options": [
 26 |                     "A. It is a news report that introduces the history behind Christmas decorations.",
 27 |                     "B. It is a documentary on the evolution of Christmas holiday recipes.",
 28 |                     "C. It is a travel vlog exploring Christmas markets around the world.",
 29 |                     "D. It is a tutorial on DIY Christmas ornament crafting."
 30 |                 ],
 31 |                 "answer": "A",
 32 |                 "response": "D.",
 33 |             },
 34 |             {
 35 |                 "question_id": "001-3",
 36 |                 "task_type": "Counting Problem",
 37 |                 "question": "How many red socks are above the fireplace at the end of this video?",
 38 |                 "options": [
 39 |                     "A. 1.",
 40 |                     "B. 4.",
 41 |                     "C. 2.",
 42 |                     "D. 3."
 43 |                 ],
 44 |                 "answer": "D",
 45 |                 "response": "D. 3",
 46 |             }
 47 |         ]
 48 |     },
 49 |     {
 50 |         "video_id": "002",
 51 |         "duration": "short",
 52 |         "domain": "Knowledge",
 53 |         "sub_category": "Humanity & History",
 54 |         "questions": [
 55 |             {
 56 |                 "question_id": "002-1",
 57 |                 "task_type": "Object Recognition",
 58 |                 "question": "Which of the following features/items is not discussed in the video in relation to the tomb?",
 59 |                 "options": [
 60 |                     "A. Inkstone.",
 61 |                     "B. Niche.",
 62 |                     "C. Jade.",
 63 |                     "D. Sacrificial table."
 64 |                 ],
 65 |                 "answer": "C",
 66 |                 "response": "Answer: C. Jade.",
 67 |             },
 68 |             {
 69 |                 "question_id": "002-2",
 70 |                 "task_type": "Action Reasoning",
 71 |                 "question": "Which of the following reasons motivated the archaeologists to excavate the tomb?",
 72 |                 "options": [
 73 |                     "A. Because it's from Ming Dynasty and of specific archaeological significance.",
 74 |                     "B. Because a new railway line will be built nearby.",
 75 |                     "C. Because there were treasures inside the tomb.",
 76 |                     "D. Highway realignment."
 77 |                 ],
 78 |                 "answer": "D",
 79 |                 "response": "D",
 80 |             },
 81 |             {
 82 |                 "question_id": "002-3",
 83 |                 "task_type": "Counting Problem",
 84 |                 "question": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?",
 85 |                 "options": [
 86 |                     "A. 4.",
 87 |                     "B. 9.",
 88 |                     "C. 5.",
 89 |                     "D. 13."
 90 |                 ],
 91 |                 "answer": "B",
 92 |                 "response": "D. 13",
 93 |             }
 94 |         ]
 95 |     },
 96 |     {
 97 |         "video_id": "003",
 98 |         "duration": "short",
 99 |         "domain": "Knowledge",
100 |         "sub_category": "Humanity & History",
101 |         "questions": [
102 |             {
103 |                 "question_id": "003-1",
104 |                 "task_type": "Counting Problem",
105 |                 "question": "How many national flags appear in the video?",
106 |                 "options": [
107 |                     "A. 3.",
108 |                     "B. 4.",
109 |                     "C. 2.",
110 |                     "D. 5."
111 |                 ],
112 |                 "answer": "B",
113 |                 "response": "B",
114 |             },
115 |             {
116 |                 "question_id": "003-2",
117 |                 "task_type": "Object Recognition",
118 |                 "question": "What is the video telling when the burger placed in the upper right corner at the end of the video first appears?",
119 |                 "options": [
120 |                     "A. Beef with spices came from Russia to Germany.",
121 |                     "B. The steak began to be sandwiched between two pieces of bread.",
122 |                     "C. Steak burgers spread throughout the United States.",
123 |                     "D. The standardization of hamburgers."
124 |                 ],
125 |                 "answer": "C",
126 |                 "response": "C.",
127 |             },
128 |             {
129 |                 "question_id": "003-3",
130 |                 "task_type": "Object Reasoning",
131 |                 "question": "In which country is the food featured in the video recognized worldwide?",
132 |                 "options": [
133 |                     "A. Mongolia.",
134 |                     "B. Russia.",
135 |                     "C. Germany.",
136 |                     "D. United States."
137 |                 ],
138 |                 "answer": "D",
139 |                 "response": "D. United States.",
140 |             }
141 |         ]
142 |     },
143 | ]
144 | 


--------------------------------------------------------------------------------