├── README.md └── eval ├── .gitignore ├── README.md └── close_ended.py /README.md: -------------------------------------------------------------------------------- 1 | # VideoEspresso 2 | \[[Paper](https://arxiv.org/abs/2411.14794)] [[Test Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso-Test)] [[Train Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_video)] 3 | 4 | ## News: 5 | [2025/4/12] 🔥 We fixed the video version training annotation. 6 | 7 | [2025/4/4] 🔥 This paper was accepted as an **oral** presentation at CVPR'25! 8 | 9 | [2025/3/29] 🔥 The training set (video version) has been updated! [[Train Set (video)](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_video)] 10 | 11 | [2025/3/24] 🔥 The training set (multi-image version) has been updated! [[Train Set (multi-image)](https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_multi_image)] 12 | 13 | [2025/2/27] 🔥 This paper has been accepted by CVPR'25! 14 | 15 | [2025/1/16] 🔥 The close-ended Leaderboard has been updated! 16 | 17 | [2024/12/17] 🔥 The close-ended benchmark has been updated! [[Close-Ended Evaluation](https://github.com/hshjerry/VideoEspresso/tree/main/eval)] 18 | 19 | [2024/12/16] 🔥 The test set has been released! Please check our huggingface repo. [[Test Set](https://huggingface.co/datasets/hshjerry0315/VideoEspresso-Test)] 20 | 21 | ## Leaderboard 22 | 23 | | Model | Params | Frames | Overall | Narrative Analysis | Event Dynamic | Preparation Steps | Causal Analysis | Theme Analysis | Contextual Analysis | Influence Analysis | Role Analysis | Interaction Analysis | Behavior Analysis | Emotion Analysis | Cooking Process | Traffic Analysis | Situation Analysis | 24 | |-----------------------|--------|--------|---------|---------------------|------------------------|------------------------------------------|-----------------|----------------|---------------------|--------------------|---------------|----------------------|------------------|------------------|----------------|------------------|--------------------| 25 | | **LLaVA-Video** | 72B | 64 | 66.3% | 68.4% | 66.2% | 74.5% | 62.7% | 62.3% | 71.6% | 62.5% | 63.5% | 67.7% | 63.2% | 60.0% | 75.5% | 76.7% | 74.0% | 26 | | **LLaVA-OneVision** | 72B | 64 | 63.2% | 76.0% | 61.8% | 71.4% | 57.5% | 62.3% | 68.8% | 62.5% | 55.6% | 58.1% | 56.1% | 63.1% | 77.4% | 70.0% | 74.0% | 27 | | **InternVL2.5** | 38B | 16 | 59.9% | 65.8% | 54.1% | 66.3% | 57.3% | 55.7% | 63.3% | 56.9% | 54.0% | 53.2% | 63.2% | 60.0% | 73.6% | 70.0% | 72.0% | 28 | | **gemini-1.5-pro** | - | 128 | 44.2% | 55.7% | 42.0% | 50.0% | 41.3% | 34.4% | 53.2% | 29.2% | 39.7% | 40.3% | 38.6% | 47.7% | 58.5% | 50.0% | 54.0% | 29 | | **Kangaroo** | 8B | 64 | 44.1% | 41.8% | 43.3% | 49.0% | 42.7% | 34.4% | 44.0% | 61.1% | 52.4% | 41.9% | 33.3% | 38.5% | 52.8% | 53.3% | 38.0% | 30 | | **Qwen-Max** | - | 4 | 42.7% | 44.3% | 35.7% | 45.9% | 39.7% | 44.3% | 54.1% | 43.1% | 47.6% | 35.5% | 45.6% | 41.5% | 49.1% | 46.7% | 46.0% | 31 | | **gemini-1.5-flash** | - | 128 | 39.8% | 59.5% | 45.2% | 38.8% | 34.7% | 32.8% | 45.9% | 30.6% | 42.9% | 43.6% | 33.3% | 38.5% | 41.5% | 36.7% | 46.0% | 32 | | **LongVA** | 7B | 128 | 39.7% | 40.5% | 33.8% | 43.9% | 35.9% | 42.6% | 42.2% | 51.4% | 47.6% | 40.3% | 35.1% | 32.3% | 39.6% | 56.7% | 48.0% | 33 | | **Qwen-VL-Chat** | 7B | 24 | 36.2% | 49.4% | 28.7% | 35.7% | 32.4% | 44.3% | 39.5% | 47.2% | 31.8% | 30.7% | 40.4% | 36.9% | 34.0% | 43.3% | 44.0% | 34 | | **VideoChat2-Mistral**| 7B | 16 | 32.1% | 31.7% | 28.7% | 27.6% | 34.3% | 36.1% | 27.5% | 31.9% | 31.8% | 43.6% | 28.1% | 38.5% | 20.8% | 36.7% | 30.0% | 35 | | **Chat-UniVi-v1.5** | 7B | 64 | 25.5% | 24.1% | 22.9% | 21.4% | 24.2% | 27.9% | 30.3% | 30.6% | 25.4% | 27.4% | 22.8% | 30.8% | 18.9% | 36.7% | 28.0% | 36 | | **SliME** | 8B | 64 | 24.8% | 19.0% | 24.2% | 26.5% | 27.0% | 19.7% | 21.1% | 30.6% | 28.6% | 29.0% | 19.3% | 21.5% | 30.2% | 20.0% | 16.0% | 37 | | **Video-XL** | 7B | 64 | 24.6% | 25.3% | 28.0% | 22.5% | 26.5% | 23.0% | 21.1% | 26.4% | 20.6% | 27.4% | 28.1% | 18.5% | 13.2% | 36.7% | 18.0% | 38 | | **Long-LLava** | 7B | 64 | 13.8% | 8.9% | 16.6% | 19.4% | 13.9% | 16.4% | 12.8% | 13.9% | 14.3% | 12.9% | 1.8% | 29.2% | 7.6% | 3.3% | 8.0% | 39 | | **ShareGPT4Video** | 8B | 16 | 8.0% | 8.9% | 10.8% | 12.2% | 8.0% | 11.5% | 8.3% | 6.9% | 7.9% | 8.1% | 0.0% | 7.7% | 3.8% | 3.3% | 4.0% | 40 | 41 | ### How You Can Participate: 42 | - **Use our benchmark**: Feel free to test your models using our benchmark and share your results. 43 | - **Submit checkpoints**: Alternatively, you can provide your model checkpoints, and we will evaluate them and update the leaderboard for you. 44 | 45 | We look forward to your participation and contributions! 🌟 46 | 47 | 48 | ## Overall View: 49 |

50 | 51 |

52 | 53 | **Contact Us** 📧 54 | If you have any questions or want to submit your checkpoints, feel free to reach out to us via email: 55 | 56 | - [hshjerry0315@gmail.com](mailto:hshjerry0315@gmail.com) 57 | - [aaron.weihuang@gmail.com](mailto:aaron.weihuang@gmail.com) 58 | 59 | ## Citation: 60 | ``` 61 | @article{han2024videoespresso, 62 | title={VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection}, 63 | author={Han, Songhao and Huang, Wei and Shi, Hairong and Zhuo, Le and Su, Xiu and Zhang, Shifeng and Zhou, Xu and Qi, Xiaojuan and Liao, Yue and Liu, Si}, 64 | journal={arXiv preprint arXiv:2411.14794}, 65 | year={2024} 66 | } 67 | ``` 68 | -------------------------------------------------------------------------------- /eval/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /eval/README.md: -------------------------------------------------------------------------------- 1 | # Close-Ended Evaluation Guidance 2 | 3 | --- 4 | 5 | ## 1. Prerequisites 6 | 7 | Before running the script, ensure the following dependencies are installed: 8 | 9 | ### Required Dependencies 10 | - **Python Version**: 3.8 or higher 11 | - **Libraries**: 12 | - `json` 13 | - `os` 14 | - `tqdm` 15 | - `PIL` (Pillow library) 16 | - `decord` 17 | - `torch` 18 | - `numpy` 19 | - `argparse` 20 | 21 | ### Installation 22 | Use the following command to install most dependencies: 23 | 24 | ```bash 25 | pip install tqdm pillow decord torch numpy 26 | ``` 27 | 28 | ### Model-Specific Setup 29 | The script uses **longva** as an example for model configuration. To use your own model, you must replace the **longva**-specific components with your own model loading and processing code (explained in Section 6). 30 | 31 | --- 32 | 33 | ## 2. Script Overview 34 | 35 | The primary purpose of this script is to: 36 | 1. Load a pretrained video inference model. 37 | 2. Read input data from a specified JSON file. 38 | 3. Perform inference on each video and question. 39 | 4. Save inference results to a new JSON file. 40 | 41 | --- 42 | 43 | ## 3. Running the Script 44 | 45 | ### Command to Run 46 | You need to specify key parameters such as the model path and JSON file path when running the script. Here's an example command: 47 | 48 | ```bash 49 | python close_ended.py -mp /path/to/your_model -sn /path/to/save.json -we True -jp /path/to/bench_final.json 50 | ``` 51 | 52 | ### Parameter Details 53 | 54 | | Parameter | Short Form | Default Value | Type | Description | 55 | |--------------------|------------|------------------------|---------|-----------------------------------------------------------------------| 56 | | `--model_path` | `-mp` | Empty string (Required) | `str` | Path to your model or checkpoint. | 57 | | `--save_path` | `-sn` | Empty string (Required) | `str` | Path to save the inference results. | 58 | | `--with_evidence` | `-we` | `False` | `bool` | Whether to include evidence in the query. | 59 | | `--json_path` | `-jp` | `bench_final.json` | `str` | Path to the input JSON file containing video data and questions. | 60 | 61 | --- 62 | 63 | ## 4. Input JSON Structure 64 | 65 | The input JSON file should contain data in the following format: 66 | 67 | ```json 68 | [ 69 | { 70 | "video_path": "path/to/video.mp4", 71 | "question": "What is happening in the video?", 72 | "options": [ 73 | "(A): Option 1", 74 | "(B): Option 2", 75 | "(C): Option 3", 76 | "(D): Option 4" 77 | ], 78 | "correct_answer": "(A)", 79 | "evidence": "Some evidence information", 80 | "task": "Description of the task" 81 | } 82 | ] 83 | ``` 84 | 85 | ### Field Descriptions 86 | - **`video_path`**: Path to the video file. 87 | - **`question`**: Question related to the video. 88 | - **`options`**: Multiple-choice options. 89 | - **`correct_answer`**: Correct answer (optional). 90 | - **`evidence`**: Evidence to assist inference (optional). 91 | - **`task`**: Description of the task. 92 | 93 | --- 94 | 95 | ## 5. Output Results 96 | 97 | The script saves the results to the file specified by `--save_path`. The output JSON will include the model's predictions in the following format: 98 | 99 | ```json 100 | [ 101 | { 102 | "video_path": "path/to/video.mp4", 103 | "question": "What is happening in the video?", 104 | "options": [ 105 | "(A): Option 1", 106 | "(B): Option 2", 107 | "(C): Option 3", 108 | "(D): Option 4" 109 | ], 110 | "correct_answer": "(A)", 111 | "evidence": "Some evidence information", 112 | "task": "Description of the task", 113 | "model_output": "(A)" 114 | } 115 | ] 116 | ``` 117 | 118 | --- 119 | 120 | ## 6. Adapting the Model Configuration 121 | 122 | The provided script uses **longva** as an example for model configuration. If you want to use your own model, follow these steps to adapt the script. 123 | 124 | ### Replace `longva` Model Loading 125 | In the script, the following code is used to load the **longva** model: 126 | 127 | ```python 128 | from longva.model.builder import load_pretrained_model 129 | 130 | tokenizer, model, image_processor, _ = load_pretrained_model( 131 | model_path, None, "llava_qwen", device_map="auto", attn_implementation="flash_attention_2" 132 | ) 133 | model.eval() 134 | ``` 135 | 136 | To adapt this to your own model: 137 | 1. Import your model loading method (e.g., `from my_model_library import MyModel`). 138 | 2. Replace the `load_pretrained_model` function with your own logic to load the model, tokenizer, and any required preprocessing utilities. 139 | 140 | For example: 141 | 142 | ```python 143 | from my_model_library import MyModel, MyTokenizer, MyImageProcessor 144 | 145 | # Load your model, tokenizer, and image processor 146 | tokenizer = MyTokenizer.from_pretrained(model_path) 147 | model = MyModel.from_pretrained(model_path).to("cuda") # or "cpu" 148 | image_processor = MyImageProcessor() 149 | model.eval() 150 | ``` 151 | 152 | ### Update Video Frame Preprocessing 153 | The **longva** script uses an image processor for frames: 154 | 155 | ```python 156 | video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16) 157 | ``` 158 | 159 | To adapt this part: 160 | - Replace `image_processor.preprocess` with your model's expected preprocessing steps. 161 | - Ensure the video frames are converted into tensors compatible with your model. 162 | 163 | For example: 164 | 165 | ```python 166 | video_tensor = torch.tensor(frames).float().permute(0, 3, 1, 2) # Convert frames to tensors 167 | video_tensor = video_tensor.to("cuda" if torch.cuda.is_available() else "cpu") # Move to device 168 | ``` 169 | 170 | ### Update Inference Logic 171 | The **longva** script generates predictions with: 172 | 173 | ```python 174 | output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs) 175 | outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() 176 | ``` 177 | 178 | To adapt this: 179 | - Replace `model.generate` with your model's inference method. 180 | - Update how the output is decoded into text. 181 | 182 | For example: 183 | 184 | ```python 185 | # Perform inference using your model 186 | with torch.no_grad(): 187 | outputs = model(video_tensor, input_ids) 188 | 189 | # Decode the output if necessary 190 | decoded_output = tokenizer.decode(outputs, skip_special_tokens=True) 191 | ``` 192 | 193 | --- 194 | 195 | ## 7. Key Functionalities in the Script 196 | 197 | ### 1. Video Frame Sampling 198 | The script uses `decord` to load and uniformly sample frames from the video: 199 | 200 | ```python 201 | vr = VideoReader(video_path, ctx=cpu(0)) 202 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, frame_input, dtype=int) 203 | frames = vr.get_batch(frame_idx).asnumpy() 204 | ``` 205 | 206 | Adapt this part if your model requires a different frame sampling strategy. 207 | 208 | ### 2. Inference Query Construction 209 | The script dynamically constructs inference queries based on the input question, task, and options: 210 | 211 | ```python 212 | final_query = f"Please finish the {task} task. Question: {question}. You have the following options: {options_prompt}. Select the answer and only give the option letters." 213 | ``` 214 | 215 | Customize this to align with your model's expected input format. 216 | 217 | --- 218 | 219 | ## 8. Notes 220 | 221 | 1. **Device Support**: The model defaults to using a GPU. If no GPU is available, the script will use a CPU, which may be slower. 222 | 2. **Video Format**: Ensure your video files are compatible with the `decord` library or your chosen video processing library. 223 | 3. **Error Handling**: If a video cannot be processed, the script may skip it. Add error-handling logic as needed. 224 | 225 | --- 226 | 227 | ## 9. Example Workflow 228 | 229 | Here’s a general workflow for adapting this script to your own model: 230 | 231 | 1. **Replace Model Loading**: Update the imports and logic to load your custom model, tokenizer, and processor. 232 | 2. **Adapt Preprocessing**: Adjust the preprocessing pipeline for video frames to match your model's requirements. 233 | 3. **Modify Inference Logic**: Update the inference method and result decoding based on your model's API. 234 | 4. **Test the Script**: Run the script on sample data to ensure it works correctly with your model. 235 | 236 | --- 237 | 238 | By following this guide, you can adapt the script to work with your own model and perform Close-Ended Evaluation seamlessly! 239 | -------------------------------------------------------------------------------- /eval/close_ended.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | from tqdm import tqdm 4 | from longva.model.builder import load_pretrained_model 5 | from longva.mm_utils import tokenizer_image_token, process_images 6 | from longva.constants import IMAGE_TOKEN_INDEX 7 | from PIL import Image 8 | from decord import VideoReader, cpu 9 | import torch 10 | import numpy as np 11 | import argparse 12 | 13 | # fix seed 14 | torch.manual_seed(0) 15 | 16 | import warnings 17 | 18 | # Suppress all UserWarnings 19 | warnings.filterwarnings("ignore", category=UserWarning) 20 | warnings.filterwarnings("ignore", category=FutureWarning) 21 | 22 | parser = argparse.ArgumentParser(description='Evaluate close end task') 23 | parser.add_argument('-mp', '--model_path', default='', type=str, help='path of model') 24 | parser.add_argument('-sn', '--save_path', default='', type=str, help='this will influence the saving path') 25 | parser.add_argument('-we', '--with_evidence', default=False, type=bool, help='with evidence or not') 26 | parser.add_argument('-jp', '--json_path', default='bench_final.json', type=str, help='path of json file') 27 | args = parser.parse_args() 28 | 29 | 30 | # Configure model path and device 31 | model_path = args.model_path 32 | save_path = args.save_path 33 | device = 0 34 | 35 | tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="auto", attn_implementation="flash_attention_2") 36 | model.eval() 37 | 38 | # Load and parse the JSON file 39 | json_file_path = args.json_path 40 | with open(json_file_path, "r") as f: 41 | data = json.load(f) 42 | 43 | # Define the inference function 44 | def run_inference(video_path, question): 45 | #video input 46 | fps = 24 47 | gen_kwargs = {"do_sample": True, "temperature": 0.8, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 512} 48 | prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\n"+question+"<|im_end|>\n<|im_start|>assistant\n" 49 | input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device) 50 | vr = VideoReader(video_path, ctx=cpu(0)) 51 | f = vr.get_avg_fps() 52 | duration = len(vr) / f 53 | total_frame_num = len(vr) 54 | frame_input = min(128, int(fps * duration)+1) 55 | print("num frames: ", total_frame_num) 56 | print("num input: ", frame_input) 57 | if total_frame_num == 0: 58 | return "" 59 | if duration >= 0.5: 60 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, frame_input, dtype=int) 61 | else: 62 | uniform_sampled_frames = np.linspace(0, total_frame_num - 1, fps, dtype=int) 63 | frame_idx = uniform_sampled_frames.tolist() 64 | frames = vr.get_batch(frame_idx).asnumpy() 65 | video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16) 66 | with torch.inference_mode(): 67 | output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs) 68 | outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() 69 | return outputs 70 | 71 | def clean_options(option): 72 | cleaned_option = option.split("):", 1)[-1].strip() 73 | return cleaned_option 74 | 75 | # Process each entry and run inference 76 | output_data = [] 77 | output_file_template = save_path 78 | print(f"Save Path: {save_path}") 79 | 80 | for i, item in enumerate(tqdm(data, desc='Evaluated on Video Benchmark')): 81 | item = data[i] 82 | video_path = item["video_path"] 83 | 84 | options = item["options"] 85 | options_prompt = "" 86 | option_list = ["\n(A) ","(B) ","(C) ","(D) "] 87 | for j, opt in enumerate(options): 88 | options_prompt += option_list[j] + clean_options(opt) + "\n" 89 | correct_answer = item["correct_answer"] 90 | evidence = item["evidence"] 91 | task = item['task'] 92 | question = item['question'] 93 | 94 | if args.with_evidence: 95 | final_query = f"Please finish the {task} task. Question: {question}. Your inference evidence is {evidence}. You have the following options: {options_prompt}. Select the answer and only give the option letters." 96 | else: 97 | final_query = f"Please finish the {task} task. Question: {question}. You have the following options: {options_prompt}. Select the answer and only give the option letters." 98 | 99 | # Call the model for inference 100 | model_output = run_inference(video_path, final_query) 101 | 102 | # Record the model's output options in the item 103 | item["model_output"] = model_output 104 | output_data.append(item) 105 | 106 | # Save data every 10 entries 107 | if (i + 1) % 10 == 0: 108 | with open(output_file_template, "w") as f: 109 | json.dump(output_data, f, ensure_ascii=False, indent=4) 110 | 111 | # Save the remaining data at the end 112 | with open(output_file_template, "w") as f: 113 | json.dump(output_data, f, ensure_ascii=False, indent=4) 114 | --------------------------------------------------------------------------------