├── README.md
└── eval
    ├── .gitignore
    ├── README.md
    └── close_ended.py


/README.md:
--------------------------------------------------------------------------------
 1 | # VideoEspresso
 2 | \[[Paper](https://arxiv.org/abs/2411.14794)] [[Test Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso-Test)] [[Train Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_video)]
 3 | 
 4 | ## News:
 5 | [2025/4/12] 🔥 We fixed the video version training annotation.
 6 | 
 7 | [2025/4/4] 🔥 This paper was accepted as an **oral** presentation at CVPR'25!
 8 | 
 9 | [2025/3/29] 🔥 The training set (video version) has been updated! [[Train Set (video)](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_video)]
10 | 
11 | [2025/3/24] 🔥 The training set (multi-image version) has been updated! [[Train Set (multi-image)](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_multi_image)]
12 | 
13 | [2025/2/27] 🔥 This paper has been accepted by CVPR'25!
14 | 
15 | [2025/1/16] 🔥 The close-ended Leaderboard has been updated!
16 | 
17 | [2024/12/17] 🔥 The close-ended benchmark has been updated! [[Close-Ended Evaluation](https://github.com/hshjerry/VideoEspresso/tree/main/eval)]
18 | 
19 | [2024/12/16] 🔥 The test set has been released! Please check our huggingface repo. [[Test Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso-Test)]
20 | 
21 | ## Leaderboard
22 | 
23 | | Model                | Params | Frames | Overall | Narrative Analysis | Event Dynamic | Preparation Steps | Causal Analysis | Theme Analysis | Contextual Analysis | Influence Analysis | Role Analysis | Interaction Analysis | Behavior Analysis | Emotion Analysis | Cooking Process | Traffic Analysis | Situation Analysis |
24 | |-----------------------|--------|--------|---------|---------------------|------------------------|------------------------------------------|-----------------|----------------|---------------------|--------------------|---------------|----------------------|------------------|------------------|----------------|------------------|--------------------|
25 | | **LLaVA-Video**      | 72B    | 64     | 66.3%   | 68.4%              | 66.2%                 | 74.5%                                   | 62.7%          | 62.3%         | 71.6%              | 62.5%             | 63.5%        | 67.7%               | 63.2%           | 60.0%           | 75.5%         | 76.7%           | 74.0%             |
26 | | **LLaVA-OneVision**  | 72B    | 64     | 63.2%   | 76.0%              | 61.8%                 | 71.4%                                   | 57.5%          | 62.3%         | 68.8%              | 62.5%             | 55.6%        | 58.1%               | 56.1%           | 63.1%           | 77.4%         | 70.0%           | 74.0%             |
27 | | **InternVL2.5**      | 38B    | 16     | 59.9%   | 65.8%              | 54.1%                 | 66.3%                                   | 57.3%          | 55.7%         | 63.3%              | 56.9%             | 54.0%        | 53.2%               | 63.2%           | 60.0%           | 73.6%         | 70.0%           | 72.0%             |
28 | | **gemini-1.5-pro**   | -      | 128    | 44.2%   | 55.7%              | 42.0%                 | 50.0%                                   | 41.3%          | 34.4%         | 53.2%              | 29.2%             | 39.7%        | 40.3%               | 38.6%           | 47.7%           | 58.5%         | 50.0%           | 54.0%             |
29 | | **Kangaroo**         | 8B     | 64     | 44.1%   | 41.8%              | 43.3%                 | 49.0%                                   | 42.7%          | 34.4%         | 44.0%              | 61.1%             | 52.4%        | 41.9%               | 33.3%           | 38.5%           | 52.8%         | 53.3%           | 38.0%             |
30 | | **Qwen-Max**         | -      | 4      | 42.7%   | 44.3%              | 35.7%                 | 45.9%                                   | 39.7%          | 44.3%         | 54.1%              | 43.1%             | 47.6%        | 35.5%               | 45.6%           | 41.5%           | 49.1%         | 46.7%           | 46.0%             |
31 | | **gemini-1.5-flash** | -      | 128    | 39.8%   | 59.5%              | 45.2%                 | 38.8%                                   | 34.7%          | 32.8%         | 45.9%              | 30.6%             | 42.9%        | 43.6%               | 33.3%           | 38.5%           | 41.5%         | 36.7%           | 46.0%             |
32 | | **LongVA**           | 7B     | 128    | 39.7%   | 40.5%              | 33.8%                 | 43.9%                                   | 35.9%          | 42.6%         | 42.2%              | 51.4%             | 47.6%        | 40.3%               | 35.1%           | 32.3%           | 39.6%         | 56.7%           | 48.0%             |
33 | | **Qwen-VL-Chat**     | 7B     | 24     | 36.2%   | 49.4%              | 28.7%                 | 35.7%                                   | 32.4%          | 44.3%         | 39.5%              | 47.2%             | 31.8%        | 30.7%               | 40.4%           | 36.9%           | 34.0%         | 43.3%           | 44.0%             |
34 | | **VideoChat2-Mistral**| 7B    | 16     | 32.1%   | 31.7%              | 28.7%                 | 27.6%                                   | 34.3%          | 36.1%         | 27.5%              | 31.9%             | 31.8%        | 43.6%               | 28.1%           | 38.5%           | 20.8%         | 36.7%           | 30.0%             |
35 | | **Chat-UniVi-v1.5**  | 7B     | 64     | 25.5%   | 24.1%              | 22.9%                 | 21.4%                                   | 24.2%          | 27.9%         | 30.3%              | 30.6%             | 25.4%        | 27.4%               | 22.8%           | 30.8%           | 18.9%         | 36.7%           | 28.0%             |
36 | | **SliME**            | 8B     | 64     | 24.8%   | 19.0%              | 24.2%                 | 26.5%                                   | 27.0%          | 19.7%         | 21.1%              | 30.6%             | 28.6%        | 29.0%               | 19.3%           | 21.5%           | 30.2%         | 20.0%           | 16.0%             |
37 | | **Video-XL**         | 7B     | 64     | 24.6%   | 25.3%              | 28.0%                 | 22.5%                                   | 26.5%          | 23.0%         | 21.1%              | 26.4%             | 20.6%        | 27.4%               | 28.1%           | 18.5%           | 13.2%         | 36.7%           | 18.0%             |
38 | | **Long-LLava**       | 7B     | 64     | 13.8%   | 8.9%               | 16.6%                 | 19.4%                                   | 13.9%          | 16.4%         | 12.8%              | 13.9%             | 14.3%        | 12.9%               | 1.8%            | 29.2%           | 7.6%          | 3.3%            | 8.0%              |
39 | | **ShareGPT4Video**   | 8B     | 16     | 8.0%    | 8.9%               | 10.8%                 | 12.2%                                   | 8.0%           | 11.5%         | 8.3%               | 6.9%              | 7.9%         | 8.1%                | 0.0%            | 7.7%            | 3.8%          | 3.3%            | 4.0%              |
40 | 
41 | ### How You Can Participate:
42 | - **Use our benchmark**: Feel free to test your models using our benchmark and share your results.  
43 | - **Submit checkpoints**: Alternatively, you can provide your model checkpoints, and we will evaluate them and update the leaderboard for you.
44 | 
45 | We look forward to your participation and contributions! 🌟
46 | 
47 | 
48 | ## Overall View:
49 | <p align="center" width="80%">
50 | <img src="https://i.postimg.cc/LXzVcgFP/Wechat-IMG197.jpg"  width="100%" height="100%">
51 | </p>
52 | 
53 | **Contact Us** 📧  
54 | If you have any questions or want to submit your checkpoints, feel free to reach out to us via email:  
55 | 
56 | - [hshjerry0315@gmail.com](mailto:hshjerry0315@gmail.com)
57 | - [aaron.weihuang@gmail.com](mailto:aaron.weihuang@gmail.com)
58 | 
59 | ## Citation:
60 | ```
61 | @article{han2024videoespresso,
62 |   title={VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection},
63 |   author={Han, Songhao and Huang, Wei and Shi, Hairong and Zhuo, Le and Su, Xiu and Zhang, Shifeng and Zhou, Xu and Qi, Xiaojuan and Liao, Yue and Liu, Si},
64 |   journal={arXiv preprint arXiv:2411.14794},
65 |   year={2024}
66 | }
67 | ```
68 | 


--------------------------------------------------------------------------------
/eval/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/eval/README.md:
--------------------------------------------------------------------------------
  1 | # Close-Ended Evaluation Guidance
  2 | 
  3 | ---
  4 | 
  5 | ## 1. Prerequisites
  6 | 
  7 | Before running the script, ensure the following dependencies are installed:
  8 | 
  9 | ### Required Dependencies
 10 | - **Python Version**: 3.8 or higher
 11 | - **Libraries**:
 12 |   - `json`
 13 |   - `os`
 14 |   - `tqdm`
 15 |   - `PIL` (Pillow library)
 16 |   - `decord`
 17 |   - `torch`
 18 |   - `numpy`
 19 |   - `argparse`
 20 | 
 21 | ### Installation
 22 | Use the following command to install most dependencies:
 23 | 
 24 | ```bash
 25 | pip install tqdm pillow decord torch numpy
 26 | ```
 27 | 
 28 | ### Model-Specific Setup
 29 | The script uses **longva** as an example for model configuration. To use your own model, you must replace the **longva**-specific components with your own model loading and processing code (explained in Section 6).
 30 | 
 31 | ---
 32 | 
 33 | ## 2. Script Overview
 34 | 
 35 | The primary purpose of this script is to:
 36 | 1. Load a pretrained video inference model.
 37 | 2. Read input data from a specified JSON file.
 38 | 3. Perform inference on each video and question.
 39 | 4. Save inference results to a new JSON file.
 40 | 
 41 | ---
 42 | 
 43 | ## 3. Running the Script
 44 | 
 45 | ### Command to Run
 46 | You need to specify key parameters such as the model path and JSON file path when running the script. Here's an example command:
 47 | 
 48 | ```bash
 49 | python close_ended.py -mp /path/to/your_model -sn /path/to/save.json -we True -jp /path/to/bench_final.json
 50 | ```
 51 | 
 52 | ### Parameter Details
 53 | 
 54 | | Parameter          | Short Form | Default Value          | Type    | Description                                                           |
 55 | |--------------------|------------|------------------------|---------|-----------------------------------------------------------------------|
 56 | | `--model_path`     | `-mp`      | Empty string (Required) | `str`   | Path to your model or checkpoint.                                     |
 57 | | `--save_path`      | `-sn`      | Empty string (Required) | `str`   | Path to save the inference results.                                   |
 58 | | `--with_evidence`  | `-we`      | `False`                | `bool`  | Whether to include evidence in the query.                            |
 59 | | `--json_path`      | `-jp`      | `bench_final.json`     | `str`   | Path to the input JSON file containing video data and questions.      |
 60 | 
 61 | ---
 62 | 
 63 | ## 4. Input JSON Structure
 64 | 
 65 | The input JSON file should contain data in the following format:
 66 | 
 67 | ```json
 68 | [
 69 |     {
 70 |         "video_path": "path/to/video.mp4",
 71 |         "question": "What is happening in the video?",
 72 |         "options": [
 73 |             "(A): Option 1",
 74 |             "(B): Option 2",
 75 |             "(C): Option 3",
 76 |             "(D): Option 4"
 77 |         ],
 78 |         "correct_answer": "(A)",
 79 |         "evidence": "Some evidence information",
 80 |         "task": "Description of the task"
 81 |     }
 82 | ]
 83 | ```
 84 | 
 85 | ### Field Descriptions
 86 | - **`video_path`**: Path to the video file.
 87 | - **`question`**: Question related to the video.
 88 | - **`options`**: Multiple-choice options.
 89 | - **`correct_answer`**: Correct answer (optional).
 90 | - **`evidence`**: Evidence to assist inference (optional).
 91 | - **`task`**: Description of the task.
 92 | 
 93 | ---
 94 | 
 95 | ## 5. Output Results
 96 | 
 97 | The script saves the results to the file specified by `--save_path`. The output JSON will include the model's predictions in the following format:
 98 | 
 99 | ```json
100 | [
101 |     {
102 |         "video_path": "path/to/video.mp4",
103 |         "question": "What is happening in the video?",
104 |         "options": [
105 |             "(A): Option 1",
106 |             "(B): Option 2",
107 |             "(C): Option 3",
108 |             "(D): Option 4"
109 |         ],
110 |         "correct_answer": "(A)",
111 |         "evidence": "Some evidence information",
112 |         "task": "Description of the task",
113 |         "model_output": "(A)"
114 |     }
115 | ]
116 | ```
117 | 
118 | ---
119 | 
120 | ## 6. Adapting the Model Configuration
121 | 
122 | The provided script uses **longva** as an example for model configuration. If you want to use your own model, follow these steps to adapt the script.
123 | 
124 | ### Replace `longva` Model Loading
125 | In the script, the following code is used to load the **longva** model:
126 | 
127 | ```python
128 | from longva.model.builder import load_pretrained_model
129 | 
130 | tokenizer, model, image_processor, _ = load_pretrained_model(
131 |     model_path, None, "llava_qwen", device_map="auto", attn_implementation="flash_attention_2"
132 | )
133 | model.eval()
134 | ```
135 | 
136 | To adapt this to your own model:
137 | 1. Import your model loading method (e.g., `from my_model_library import MyModel`).
138 | 2. Replace the `load_pretrained_model` function with your own logic to load the model, tokenizer, and any required preprocessing utilities.
139 | 
140 | For example:
141 | 
142 | ```python
143 | from my_model_library import MyModel, MyTokenizer, MyImageProcessor
144 | 
145 | # Load your model, tokenizer, and image processor
146 | tokenizer = MyTokenizer.from_pretrained(model_path)
147 | model = MyModel.from_pretrained(model_path).to("cuda")  # or "cpu"
148 | image_processor = MyImageProcessor()
149 | model.eval()
150 | ```
151 | 
152 | ### Update Video Frame Preprocessing
153 | The **longva** script uses an image processor for frames:
154 | 
155 | ```python
156 | video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
157 | ```
158 | 
159 | To adapt this part:
160 | - Replace `image_processor.preprocess` with your model's expected preprocessing steps.
161 | - Ensure the video frames are converted into tensors compatible with your model.
162 | 
163 | For example:
164 | 
165 | ```python
166 | video_tensor = torch.tensor(frames).float().permute(0, 3, 1, 2)  # Convert frames to tensors
167 | video_tensor = video_tensor.to("cuda" if torch.cuda.is_available() else "cpu")  # Move to device
168 | ```
169 | 
170 | ### Update Inference Logic
171 | The **longva** script generates predictions with:
172 | 
173 | ```python
174 | output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs)
175 | outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
176 | ```
177 | 
178 | To adapt this:
179 | - Replace `model.generate` with your model's inference method.
180 | - Update how the output is decoded into text.
181 | 
182 | For example:
183 | 
184 | ```python
185 | # Perform inference using your model
186 | with torch.no_grad():
187 |     outputs = model(video_tensor, input_ids)
188 | 
189 | # Decode the output if necessary
190 | decoded_output = tokenizer.decode(outputs, skip_special_tokens=True)
191 | ```
192 | 
193 | ---
194 | 
195 | ## 7. Key Functionalities in the Script
196 | 
197 | ### 1. Video Frame Sampling
198 | The script uses `decord` to load and uniformly sample frames from the video:
199 | 
200 | ```python
201 | vr = VideoReader(video_path, ctx=cpu(0))
202 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, frame_input, dtype=int)
203 | frames = vr.get_batch(frame_idx).asnumpy()
204 | ```
205 | 
206 | Adapt this part if your model requires a different frame sampling strategy.
207 | 
208 | ### 2. Inference Query Construction
209 | The script dynamically constructs inference queries based on the input question, task, and options:
210 | 
211 | ```python
212 | final_query = f"Please finish the {task} task. Question: {question}. You have the following options: {options_prompt}. Select the answer and only give the option letters."
213 | ```
214 | 
215 | Customize this to align with your model's expected input format.
216 | 
217 | ---
218 | 
219 | ## 8. Notes
220 | 
221 | 1. **Device Support**: The model defaults to using a GPU. If no GPU is available, the script will use a CPU, which may be slower.
222 | 2. **Video Format**: Ensure your video files are compatible with the `decord` library or your chosen video processing library.
223 | 3. **Error Handling**: If a video cannot be processed, the script may skip it. Add error-handling logic as needed.
224 | 
225 | ---
226 | 
227 | ## 9. Example Workflow
228 | 
229 | Here’s a general workflow for adapting this script to your own model:
230 | 
231 | 1. **Replace Model Loading**: Update the imports and logic to load your custom model, tokenizer, and processor.
232 | 2. **Adapt Preprocessing**: Adjust the preprocessing pipeline for video frames to match your model's requirements.
233 | 3. **Modify Inference Logic**: Update the inference method and result decoding based on your model's API.
234 | 4. **Test the Script**: Run the script on sample data to ensure it works correctly with your model.
235 | 
236 | ---
237 | 
238 | By following this guide, you can adapt the script to work with your own model and perform Close-Ended Evaluation seamlessly!
239 | 


--------------------------------------------------------------------------------
/eval/close_ended.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | from tqdm import tqdm
  4 | from longva.model.builder import load_pretrained_model
  5 | from longva.mm_utils import tokenizer_image_token, process_images
  6 | from longva.constants import IMAGE_TOKEN_INDEX
  7 | from PIL import Image
  8 | from decord import VideoReader, cpu
  9 | import torch
 10 | import numpy as np
 11 | import argparse
 12 | 
 13 | # fix seed
 14 | torch.manual_seed(0)
 15 | 
 16 | import warnings
 17 | 
 18 | # Suppress all UserWarnings
 19 | warnings.filterwarnings("ignore", category=UserWarning)
 20 | warnings.filterwarnings("ignore", category=FutureWarning)
 21 | 
 22 | parser = argparse.ArgumentParser(description='Evaluate close end task')
 23 | parser.add_argument('-mp', '--model_path', default='', type=str, help='path of model')
 24 | parser.add_argument('-sn', '--save_path', default='', type=str, help='this will influence the saving path')
 25 | parser.add_argument('-we', '--with_evidence', default=False, type=bool, help='with evidence or not')
 26 | parser.add_argument('-jp', '--json_path', default='bench_final.json', type=str, help='path of json file')
 27 | args = parser.parse_args()
 28 | 
 29 | 
 30 | # Configure model path and device
 31 | model_path = args.model_path
 32 | save_path = args.save_path
 33 | device = 0
 34 | 
 35 | tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="auto", attn_implementation="flash_attention_2")
 36 | model.eval()
 37 | 
 38 | # Load and parse the JSON file
 39 | json_file_path = args.json_path
 40 | with open(json_file_path, "r") as f:
 41 |     data = json.load(f)
 42 | 
 43 | # Define the inference function
 44 | def run_inference(video_path, question):    
 45 |     #video input
 46 |     fps = 24
 47 |     gen_kwargs = {"do_sample": True, "temperature": 0.8, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 512}
 48 |     prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\n"+question+"<|im_end|>\n<|im_start|>assistant\n"
 49 |     input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
 50 |     vr = VideoReader(video_path, ctx=cpu(0))
 51 |     f = vr.get_avg_fps()
 52 |     duration = len(vr) / f
 53 |     total_frame_num = len(vr)
 54 |     frame_input = min(128, int(fps * duration)+1)
 55 |     print("num frames: ", total_frame_num)
 56 |     print("num input: ", frame_input)
 57 |     if total_frame_num == 0:
 58 |         return ""
 59 |     if duration >= 0.5:
 60 |         uniform_sampled_frames = np.linspace(0, total_frame_num - 1, frame_input, dtype=int)
 61 |     else:
 62 |         uniform_sampled_frames = np.linspace(0, total_frame_num - 1, fps, dtype=int)
 63 |     frame_idx = uniform_sampled_frames.tolist()
 64 |     frames = vr.get_batch(frame_idx).asnumpy()
 65 |     video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
 66 |     with torch.inference_mode():
 67 |         output_ids = model.generate(input_ids, images=[video_tensor],  modalities=["video"], **gen_kwargs)
 68 |     outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
 69 |     return outputs
 70 | 
 71 | def clean_options(option):
 72 |     cleaned_option = option.split("):", 1)[-1].strip()
 73 |     return cleaned_option
 74 | 
 75 | # Process each entry and run inference
 76 | output_data = []
 77 | output_file_template = save_path
 78 | print(f"Save Path: {save_path}")
 79 | 
 80 | for i, item in enumerate(tqdm(data, desc='Evaluated on Video Benchmark')):
 81 |     item = data[i]
 82 |     video_path = item["video_path"]
 83 | 
 84 |     options = item["options"]
 85 |     options_prompt = ""
 86 |     option_list = ["\n(A) ","(B) ","(C) ","(D) "]
 87 |     for j, opt in enumerate(options):
 88 |         options_prompt += option_list[j] + clean_options(opt) + "\n"
 89 |     correct_answer = item["correct_answer"]
 90 |     evidence = item["evidence"]
 91 |     task = item['task']
 92 |     question = item['question']
 93 | 
 94 |     if args.with_evidence:
 95 |         final_query = f"Please finish the {task} task. Question: {question}. Your inference evidence is {evidence}. You have the following options: {options_prompt}. Select the answer and only give the option letters."
 96 |     else: 
 97 |         final_query = f"Please finish the {task} task. Question: {question}. You have the following options: {options_prompt}. Select the answer and only give the option letters."
 98 | 
 99 |     # Call the model for inference
100 |     model_output = run_inference(video_path, final_query)
101 |     
102 |     # Record the model's output options in the item
103 |     item["model_output"] = model_output
104 |     output_data.append(item)
105 |     
106 |     # Save data every 10 entries
107 |     if (i + 1) % 10 == 0:
108 |         with open(output_file_template, "w") as f:
109 |             json.dump(output_data, f, ensure_ascii=False, indent=4)    
110 |         
111 | # Save the remaining data at the end
112 | with open(output_file_template, "w") as f:
113 |     json.dump(output_data, f, ensure_ascii=False, indent=4)
114 | 


--------------------------------------------------------------------------------