├── README.md
└── eval
├── .gitignore
├── README.md
└── close_ended.py
/README.md:
--------------------------------------------------------------------------------
1 | # VideoEspresso
2 | \[[Paper](https://arxiv.org/abs/2411.14794)] [[Test Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso-Test)] [[Train Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_video)]
3 |
4 | ## News:
5 | [2025/4/12] 🔥 We fixed the video version training annotation.
6 |
7 | [2025/4/4] 🔥 This paper was accepted as an **oral** presentation at CVPR'25!
8 |
9 | [2025/3/29] 🔥 The training set (video version) has been updated! [[Train Set (video)](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_video)]
10 |
11 | [2025/3/24] 🔥 The training set (multi-image version) has been updated! [[Train Set (multi-image)](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_multi_image)]
12 |
13 | [2025/2/27] 🔥 This paper has been accepted by CVPR'25!
14 |
15 | [2025/1/16] 🔥 The close-ended Leaderboard has been updated!
16 |
17 | [2024/12/17] 🔥 The close-ended benchmark has been updated! [[Close-Ended Evaluation](https://github.com/hshjerry/VideoEspresso/tree/main/eval)]
18 |
19 | [2024/12/16] 🔥 The test set has been released! Please check our huggingface repo. [[Test Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso-Test)]
20 |
21 | ## Leaderboard
22 |
23 | | Model | Params | Frames | Overall | Narrative Analysis | Event Dynamic | Preparation Steps | Causal Analysis | Theme Analysis | Contextual Analysis | Influence Analysis | Role Analysis | Interaction Analysis | Behavior Analysis | Emotion Analysis | Cooking Process | Traffic Analysis | Situation Analysis |
24 | |-----------------------|--------|--------|---------|---------------------|------------------------|------------------------------------------|-----------------|----------------|---------------------|--------------------|---------------|----------------------|------------------|------------------|----------------|------------------|--------------------|
25 | | **LLaVA-Video** | 72B | 64 | 66.3% | 68.4% | 66.2% | 74.5% | 62.7% | 62.3% | 71.6% | 62.5% | 63.5% | 67.7% | 63.2% | 60.0% | 75.5% | 76.7% | 74.0% |
26 | | **LLaVA-OneVision** | 72B | 64 | 63.2% | 76.0% | 61.8% | 71.4% | 57.5% | 62.3% | 68.8% | 62.5% | 55.6% | 58.1% | 56.1% | 63.1% | 77.4% | 70.0% | 74.0% |
27 | | **InternVL2.5** | 38B | 16 | 59.9% | 65.8% | 54.1% | 66.3% | 57.3% | 55.7% | 63.3% | 56.9% | 54.0% | 53.2% | 63.2% | 60.0% | 73.6% | 70.0% | 72.0% |
28 | | **gemini-1.5-pro** | - | 128 | 44.2% | 55.7% | 42.0% | 50.0% | 41.3% | 34.4% | 53.2% | 29.2% | 39.7% | 40.3% | 38.6% | 47.7% | 58.5% | 50.0% | 54.0% |
29 | | **Kangaroo** | 8B | 64 | 44.1% | 41.8% | 43.3% | 49.0% | 42.7% | 34.4% | 44.0% | 61.1% | 52.4% | 41.9% | 33.3% | 38.5% | 52.8% | 53.3% | 38.0% |
30 | | **Qwen-Max** | - | 4 | 42.7% | 44.3% | 35.7% | 45.9% | 39.7% | 44.3% | 54.1% | 43.1% | 47.6% | 35.5% | 45.6% | 41.5% | 49.1% | 46.7% | 46.0% |
31 | | **gemini-1.5-flash** | - | 128 | 39.8% | 59.5% | 45.2% | 38.8% | 34.7% | 32.8% | 45.9% | 30.6% | 42.9% | 43.6% | 33.3% | 38.5% | 41.5% | 36.7% | 46.0% |
32 | | **LongVA** | 7B | 128 | 39.7% | 40.5% | 33.8% | 43.9% | 35.9% | 42.6% | 42.2% | 51.4% | 47.6% | 40.3% | 35.1% | 32.3% | 39.6% | 56.7% | 48.0% |
33 | | **Qwen-VL-Chat** | 7B | 24 | 36.2% | 49.4% | 28.7% | 35.7% | 32.4% | 44.3% | 39.5% | 47.2% | 31.8% | 30.7% | 40.4% | 36.9% | 34.0% | 43.3% | 44.0% |
34 | | **VideoChat2-Mistral**| 7B | 16 | 32.1% | 31.7% | 28.7% | 27.6% | 34.3% | 36.1% | 27.5% | 31.9% | 31.8% | 43.6% | 28.1% | 38.5% | 20.8% | 36.7% | 30.0% |
35 | | **Chat-UniVi-v1.5** | 7B | 64 | 25.5% | 24.1% | 22.9% | 21.4% | 24.2% | 27.9% | 30.3% | 30.6% | 25.4% | 27.4% | 22.8% | 30.8% | 18.9% | 36.7% | 28.0% |
36 | | **SliME** | 8B | 64 | 24.8% | 19.0% | 24.2% | 26.5% | 27.0% | 19.7% | 21.1% | 30.6% | 28.6% | 29.0% | 19.3% | 21.5% | 30.2% | 20.0% | 16.0% |
37 | | **Video-XL** | 7B | 64 | 24.6% | 25.3% | 28.0% | 22.5% | 26.5% | 23.0% | 21.1% | 26.4% | 20.6% | 27.4% | 28.1% | 18.5% | 13.2% | 36.7% | 18.0% |
38 | | **Long-LLava** | 7B | 64 | 13.8% | 8.9% | 16.6% | 19.4% | 13.9% | 16.4% | 12.8% | 13.9% | 14.3% | 12.9% | 1.8% | 29.2% | 7.6% | 3.3% | 8.0% |
39 | | **ShareGPT4Video** | 8B | 16 | 8.0% | 8.9% | 10.8% | 12.2% | 8.0% | 11.5% | 8.3% | 6.9% | 7.9% | 8.1% | 0.0% | 7.7% | 3.8% | 3.3% | 4.0% |
40 |
41 | ### How You Can Participate:
42 | - **Use our benchmark**: Feel free to test your models using our benchmark and share your results.
43 | - **Submit checkpoints**: Alternatively, you can provide your model checkpoints, and we will evaluate them and update the leaderboard for you.
44 |
45 | We look forward to your participation and contributions! 🌟
46 |
47 |
48 | ## Overall View:
49 |
50 |
51 |
52 |
53 | **Contact Us** 📧
54 | If you have any questions or want to submit your checkpoints, feel free to reach out to us via email:
55 |
56 | - [hshjerry0315@gmail.com](mailto:hshjerry0315@gmail.com)
57 | - [aaron.weihuang@gmail.com](mailto:aaron.weihuang@gmail.com)
58 |
59 | ## Citation:
60 | ```
61 | @article{han2024videoespresso,
62 | title={VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection},
63 | author={Han, Songhao and Huang, Wei and Shi, Hairong and Zhuo, Le and Su, Xiu and Zhang, Shifeng and Zhou, Xu and Qi, Xiaojuan and Liao, Yue and Liu, Si},
64 | journal={arXiv preprint arXiv:2411.14794},
65 | year={2024}
66 | }
67 | ```
68 |
--------------------------------------------------------------------------------
/eval/.gitignore:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/eval/README.md:
--------------------------------------------------------------------------------
1 | # Close-Ended Evaluation Guidance
2 |
3 | ---
4 |
5 | ## 1. Prerequisites
6 |
7 | Before running the script, ensure the following dependencies are installed:
8 |
9 | ### Required Dependencies
10 | - **Python Version**: 3.8 or higher
11 | - **Libraries**:
12 | - `json`
13 | - `os`
14 | - `tqdm`
15 | - `PIL` (Pillow library)
16 | - `decord`
17 | - `torch`
18 | - `numpy`
19 | - `argparse`
20 |
21 | ### Installation
22 | Use the following command to install most dependencies:
23 |
24 | ```bash
25 | pip install tqdm pillow decord torch numpy
26 | ```
27 |
28 | ### Model-Specific Setup
29 | The script uses **longva** as an example for model configuration. To use your own model, you must replace the **longva**-specific components with your own model loading and processing code (explained in Section 6).
30 |
31 | ---
32 |
33 | ## 2. Script Overview
34 |
35 | The primary purpose of this script is to:
36 | 1. Load a pretrained video inference model.
37 | 2. Read input data from a specified JSON file.
38 | 3. Perform inference on each video and question.
39 | 4. Save inference results to a new JSON file.
40 |
41 | ---
42 |
43 | ## 3. Running the Script
44 |
45 | ### Command to Run
46 | You need to specify key parameters such as the model path and JSON file path when running the script. Here's an example command:
47 |
48 | ```bash
49 | python close_ended.py -mp /path/to/your_model -sn /path/to/save.json -we True -jp /path/to/bench_final.json
50 | ```
51 |
52 | ### Parameter Details
53 |
54 | | Parameter | Short Form | Default Value | Type | Description |
55 | |--------------------|------------|------------------------|---------|-----------------------------------------------------------------------|
56 | | `--model_path` | `-mp` | Empty string (Required) | `str` | Path to your model or checkpoint. |
57 | | `--save_path` | `-sn` | Empty string (Required) | `str` | Path to save the inference results. |
58 | | `--with_evidence` | `-we` | `False` | `bool` | Whether to include evidence in the query. |
59 | | `--json_path` | `-jp` | `bench_final.json` | `str` | Path to the input JSON file containing video data and questions. |
60 |
61 | ---
62 |
63 | ## 4. Input JSON Structure
64 |
65 | The input JSON file should contain data in the following format:
66 |
67 | ```json
68 | [
69 | {
70 | "video_path": "path/to/video.mp4",
71 | "question": "What is happening in the video?",
72 | "options": [
73 | "(A): Option 1",
74 | "(B): Option 2",
75 | "(C): Option 3",
76 | "(D): Option 4"
77 | ],
78 | "correct_answer": "(A)",
79 | "evidence": "Some evidence information",
80 | "task": "Description of the task"
81 | }
82 | ]
83 | ```
84 |
85 | ### Field Descriptions
86 | - **`video_path`**: Path to the video file.
87 | - **`question`**: Question related to the video.
88 | - **`options`**: Multiple-choice options.
89 | - **`correct_answer`**: Correct answer (optional).
90 | - **`evidence`**: Evidence to assist inference (optional).
91 | - **`task`**: Description of the task.
92 |
93 | ---
94 |
95 | ## 5. Output Results
96 |
97 | The script saves the results to the file specified by `--save_path`. The output JSON will include the model's predictions in the following format:
98 |
99 | ```json
100 | [
101 | {
102 | "video_path": "path/to/video.mp4",
103 | "question": "What is happening in the video?",
104 | "options": [
105 | "(A): Option 1",
106 | "(B): Option 2",
107 | "(C): Option 3",
108 | "(D): Option 4"
109 | ],
110 | "correct_answer": "(A)",
111 | "evidence": "Some evidence information",
112 | "task": "Description of the task",
113 | "model_output": "(A)"
114 | }
115 | ]
116 | ```
117 |
118 | ---
119 |
120 | ## 6. Adapting the Model Configuration
121 |
122 | The provided script uses **longva** as an example for model configuration. If you want to use your own model, follow these steps to adapt the script.
123 |
124 | ### Replace `longva` Model Loading
125 | In the script, the following code is used to load the **longva** model:
126 |
127 | ```python
128 | from longva.model.builder import load_pretrained_model
129 |
130 | tokenizer, model, image_processor, _ = load_pretrained_model(
131 | model_path, None, "llava_qwen", device_map="auto", attn_implementation="flash_attention_2"
132 | )
133 | model.eval()
134 | ```
135 |
136 | To adapt this to your own model:
137 | 1. Import your model loading method (e.g., `from my_model_library import MyModel`).
138 | 2. Replace the `load_pretrained_model` function with your own logic to load the model, tokenizer, and any required preprocessing utilities.
139 |
140 | For example:
141 |
142 | ```python
143 | from my_model_library import MyModel, MyTokenizer, MyImageProcessor
144 |
145 | # Load your model, tokenizer, and image processor
146 | tokenizer = MyTokenizer.from_pretrained(model_path)
147 | model = MyModel.from_pretrained(model_path).to("cuda") # or "cpu"
148 | image_processor = MyImageProcessor()
149 | model.eval()
150 | ```
151 |
152 | ### Update Video Frame Preprocessing
153 | The **longva** script uses an image processor for frames:
154 |
155 | ```python
156 | video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
157 | ```
158 |
159 | To adapt this part:
160 | - Replace `image_processor.preprocess` with your model's expected preprocessing steps.
161 | - Ensure the video frames are converted into tensors compatible with your model.
162 |
163 | For example:
164 |
165 | ```python
166 | video_tensor = torch.tensor(frames).float().permute(0, 3, 1, 2) # Convert frames to tensors
167 | video_tensor = video_tensor.to("cuda" if torch.cuda.is_available() else "cpu") # Move to device
168 | ```
169 |
170 | ### Update Inference Logic
171 | The **longva** script generates predictions with:
172 |
173 | ```python
174 | output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs)
175 | outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
176 | ```
177 |
178 | To adapt this:
179 | - Replace `model.generate` with your model's inference method.
180 | - Update how the output is decoded into text.
181 |
182 | For example:
183 |
184 | ```python
185 | # Perform inference using your model
186 | with torch.no_grad():
187 | outputs = model(video_tensor, input_ids)
188 |
189 | # Decode the output if necessary
190 | decoded_output = tokenizer.decode(outputs, skip_special_tokens=True)
191 | ```
192 |
193 | ---
194 |
195 | ## 7. Key Functionalities in the Script
196 |
197 | ### 1. Video Frame Sampling
198 | The script uses `decord` to load and uniformly sample frames from the video:
199 |
200 | ```python
201 | vr = VideoReader(video_path, ctx=cpu(0))
202 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, frame_input, dtype=int)
203 | frames = vr.get_batch(frame_idx).asnumpy()
204 | ```
205 |
206 | Adapt this part if your model requires a different frame sampling strategy.
207 |
208 | ### 2. Inference Query Construction
209 | The script dynamically constructs inference queries based on the input question, task, and options:
210 |
211 | ```python
212 | final_query = f"Please finish the {task} task. Question: {question}. You have the following options: {options_prompt}. Select the answer and only give the option letters."
213 | ```
214 |
215 | Customize this to align with your model's expected input format.
216 |
217 | ---
218 |
219 | ## 8. Notes
220 |
221 | 1. **Device Support**: The model defaults to using a GPU. If no GPU is available, the script will use a CPU, which may be slower.
222 | 2. **Video Format**: Ensure your video files are compatible with the `decord` library or your chosen video processing library.
223 | 3. **Error Handling**: If a video cannot be processed, the script may skip it. Add error-handling logic as needed.
224 |
225 | ---
226 |
227 | ## 9. Example Workflow
228 |
229 | Here’s a general workflow for adapting this script to your own model:
230 |
231 | 1. **Replace Model Loading**: Update the imports and logic to load your custom model, tokenizer, and processor.
232 | 2. **Adapt Preprocessing**: Adjust the preprocessing pipeline for video frames to match your model's requirements.
233 | 3. **Modify Inference Logic**: Update the inference method and result decoding based on your model's API.
234 | 4. **Test the Script**: Run the script on sample data to ensure it works correctly with your model.
235 |
236 | ---
237 |
238 | By following this guide, you can adapt the script to work with your own model and perform Close-Ended Evaluation seamlessly!
239 |
--------------------------------------------------------------------------------
/eval/close_ended.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | from tqdm import tqdm
4 | from longva.model.builder import load_pretrained_model
5 | from longva.mm_utils import tokenizer_image_token, process_images
6 | from longva.constants import IMAGE_TOKEN_INDEX
7 | from PIL import Image
8 | from decord import VideoReader, cpu
9 | import torch
10 | import numpy as np
11 | import argparse
12 |
13 | # fix seed
14 | torch.manual_seed(0)
15 |
16 | import warnings
17 |
18 | # Suppress all UserWarnings
19 | warnings.filterwarnings("ignore", category=UserWarning)
20 | warnings.filterwarnings("ignore", category=FutureWarning)
21 |
22 | parser = argparse.ArgumentParser(description='Evaluate close end task')
23 | parser.add_argument('-mp', '--model_path', default='', type=str, help='path of model')
24 | parser.add_argument('-sn', '--save_path', default='', type=str, help='this will influence the saving path')
25 | parser.add_argument('-we', '--with_evidence', default=False, type=bool, help='with evidence or not')
26 | parser.add_argument('-jp', '--json_path', default='bench_final.json', type=str, help='path of json file')
27 | args = parser.parse_args()
28 |
29 |
30 | # Configure model path and device
31 | model_path = args.model_path
32 | save_path = args.save_path
33 | device = 0
34 |
35 | tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="auto", attn_implementation="flash_attention_2")
36 | model.eval()
37 |
38 | # Load and parse the JSON file
39 | json_file_path = args.json_path
40 | with open(json_file_path, "r") as f:
41 | data = json.load(f)
42 |
43 | # Define the inference function
44 | def run_inference(video_path, question):
45 | #video input
46 | fps = 24
47 | gen_kwargs = {"do_sample": True, "temperature": 0.8, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 512}
48 | prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\n"+question+"<|im_end|>\n<|im_start|>assistant\n"
49 | input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
50 | vr = VideoReader(video_path, ctx=cpu(0))
51 | f = vr.get_avg_fps()
52 | duration = len(vr) / f
53 | total_frame_num = len(vr)
54 | frame_input = min(128, int(fps * duration)+1)
55 | print("num frames: ", total_frame_num)
56 | print("num input: ", frame_input)
57 | if total_frame_num == 0:
58 | return ""
59 | if duration >= 0.5:
60 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, frame_input, dtype=int)
61 | else:
62 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, fps, dtype=int)
63 | frame_idx = uniform_sampled_frames.tolist()
64 | frames = vr.get_batch(frame_idx).asnumpy()
65 | video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
66 | with torch.inference_mode():
67 | output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs)
68 | outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
69 | return outputs
70 |
71 | def clean_options(option):
72 | cleaned_option = option.split("):", 1)[-1].strip()
73 | return cleaned_option
74 |
75 | # Process each entry and run inference
76 | output_data = []
77 | output_file_template = save_path
78 | print(f"Save Path: {save_path}")
79 |
80 | for i, item in enumerate(tqdm(data, desc='Evaluated on Video Benchmark')):
81 | item = data[i]
82 | video_path = item["video_path"]
83 |
84 | options = item["options"]
85 | options_prompt = ""
86 | option_list = ["\n(A) ","(B) ","(C) ","(D) "]
87 | for j, opt in enumerate(options):
88 | options_prompt += option_list[j] + clean_options(opt) + "\n"
89 | correct_answer = item["correct_answer"]
90 | evidence = item["evidence"]
91 | task = item['task']
92 | question = item['question']
93 |
94 | if args.with_evidence:
95 | final_query = f"Please finish the {task} task. Question: {question}. Your inference evidence is {evidence}. You have the following options: {options_prompt}. Select the answer and only give the option letters."
96 | else:
97 | final_query = f"Please finish the {task} task. Question: {question}. You have the following options: {options_prompt}. Select the answer and only give the option letters."
98 |
99 | # Call the model for inference
100 | model_output = run_inference(video_path, final_query)
101 |
102 | # Record the model's output options in the item
103 | item["model_output"] = model_output
104 | output_data.append(item)
105 |
106 | # Save data every 10 entries
107 | if (i + 1) % 10 == 0:
108 | with open(output_file_template, "w") as f:
109 | json.dump(output_data, f, ensure_ascii=False, indent=4)
110 |
111 | # Save the remaining data at the end
112 | with open(output_file_template, "w") as f:
113 | json.dump(output_data, f, ensure_ascii=False, indent=4)
114 |
--------------------------------------------------------------------------------