├── .gitignore ├── README.md ├── assets ├── radar.png ├── v_star_logo.png ├── vis.png └── vis_v2.png ├── data ├── README.md └── V_STaR_test.json ├── eval.py └── inference_demo.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![v_star_logo](https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/main/assets/v_star_logo.png) 2 | 3 | 4 | [![V-STaR Paper](https://img.shields.io/badge/ArXiV%202025-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/pdf/2503.11495) 5 | [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue)](https://huggingface.co/spaces/V-STaR-Bench/V-STaR-LeaderBoard) 6 | [![Dataset Download](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset%20Download-blue)](https://huggingface.co/datasets/V-STaR-Bench/V-STaR) 7 | [![Project Page](https://img.shields.io/badge/VSTaR-Website-green?logo=googlechrome&logoColor=green)](https://v-star-bench.github.io/) 8 | [![Project Page](https://img.shields.io/badge/Collection-MLLM%20Reasoning-blue?logo=github&logoColor=)](https://github.com/lwpyh/Awesome-MLLM-Reasoning-Collection) 9 | 10 | 11 | 12 | This repository contains the implementation of the following paper and its related serial works in progress. We evaluate Video LLMs models! 13 | > **V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning**

14 | Zixu Cheng,   15 | Jian Hu+,   16 | Ziquan Liu,   17 | Chenyang Si,   18 | Wei Li,   19 | Shaogang Gong 20 |

21 | 22 | 23 | 24 | ### Table of Contents 25 | - [Updates](#updates) 26 | - [Overview](#overview) 27 | - [Evaluation Results](#evaluation_results) 28 | - [Usage](#usage) 29 | - [Citation and Acknowledgement](#citation_and_acknowledgement) 30 | 31 | 32 | ## :mega: Updates 33 | - [03/2025] **MLLM reasoning collections released:** We’re excited to announce the release of the [Awesome-MLLM-Reasoning-Collection](https://github.com/lwpyh/Awesome-MLLM-Reasoning-Collection) repository! 🚀 This collection curates papers, code, datasets, benchmarks, and resources dedicated to MLLM reasoning, providing a valuable resource for researchers and practitioners in the field. Check it out and contribute! 🔍📚💡 34 | - [03/2025] **Dataset released:** Our evaluation dataset is released. 35 | - [03/2025] **Code and leaderboard released:** Our V-STAR benchmark is released. 36 | 37 | ## :mega: Overview 38 | ![Abstract](assets/vis.png) 39 | Overview of V-STaR. Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning. 40 | 41 | 42 | 43 | ## :mortar_board: Evaluation Results 44 | 45 | ***See our leaderboard for the most updated ranking and numerical results (with models like GPT-4o, Gemini-2-flash and Qwen2.5-VL)***. [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue)](https://v-star-bench-v-star-leaderboard.hf.space/) 46 | 47 | We visualize the evaluation results of the 6 most recent top-performing Video-LLMs across 9 V-STaR domains. 48 | 49 | We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains. 50 | 51 |

52 | 53 |

54 | 55 | ## :trophy: Leaderboard 56 | 57 | See numeric values at our [Leaderboard](https://huggingface.co/spaces/V-STaR-Bench/V-STaR-LeaderBoard) :1st_place_medal::2nd_place_medal::3rd_place_medal: 58 | 59 | #### **How to join VBench Leaderboard?** 60 | 61 | please contact us via email (jian.hu@qmul.ac.uk or zixu.cheng@qmul.ac.uk) to update your results. 62 | 63 | #### Evaluation Criterion 64 | 65 | To evaluate the open-ended *"what"* question, we use Qwen2.5-72B-Instruct to score answers from 0 to 4, denoting entirely incorrect, largely incorrect, largely correct, and entirely correct. Answers scoring above 2 are considered correct, allowing us to compute accuracy. 66 | 67 | For the *"when"* question, we follow the commonly used temporal grounding metrics, , which refers to the percentage of top-n prediction with temporal IoU score larger than m, and mean temporal IoU score (m\_tIoU). 68 | 69 | For the *"where"* question, we use the Average Precision score (AP@vIoU=m) and mean visual Intersection over the Union (m\_vIoU) of every annotated frame. We follow the proposed LGM and AM to measure a model's spatial-temporal reasoning ability. A higher LGM indicates a better overall spatio-temporal reasoning ability of the model, and a higher AM indicates a more average performance of the model on the three metrics. 70 | 71 | 72 | ## Usage 73 | ##### Download V-STaR dataset on HuggingFace 74 | 75 | ``` 76 | git lfs install 77 | git clone https://huggingface.co/datasets/V-STaR-Bench/V-STaR 78 | ``` 79 | 80 | ##### Use V-STaR to evaluate Video-LLMs 81 | 82 | We provide our inference_demo.py script to test Qwen2.5-VL-7B with: 83 | 84 | ``` 85 | python inference_demo.py 86 | ``` 87 | You can try your Video-LLMs to infer on V-STaR based on the provided scripts to test the model's spatio-temporal reasoning ability. 88 | 89 | To evaluate the results, update your result file path in the eval.py script and run: 90 | 91 | ``` 92 | python eval.py 93 | ``` 94 | Noted: You need at least 2 NVIDIA A100 80G GPUs to run Qwen-2.5-72B for evaluation. 95 | 96 | ## Submit to Leaderboard 97 | 98 | please contact us via email (jian.hu@qmul.ac.uk or zixu.cheng@qmul.ac.uk) to update your results. 99 | 100 | 101 | 102 | ## :black_nib: Citation 103 | 104 | If you find our repo useful for your research, please consider citing our paper: 105 | 106 | ```bibtex 107 | @misc{cheng2025vstarbenchmarkingvideollmsvideo, 108 | title={V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning}, 109 | author={Zixu Cheng and Jian Hu and Ziquan Liu and Chenyang Si and Wei Li and Shaogang Gong}, 110 | year={2025}, 111 | eprint={2503.11495}, 112 | archivePrefix={arXiv}, 113 | primaryClass={cs.CV}, 114 | url={https://arxiv.org/abs/2503.11495}, 115 | } 116 | ``` 117 | -------------------------------------------------------------------------------- /assets/radar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/radar.png -------------------------------------------------------------------------------- /assets/v_star_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/v_star_logo.png -------------------------------------------------------------------------------- /assets/vis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/vis.png -------------------------------------------------------------------------------- /assets/vis_v2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/vis_v2.png -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | # **Download the Dataset** 2 | 3 | Video: Please down the video data from [HuggingFace](https://huggingface.co/datasets/V-STaR-Bench/V-STaR). 4 | 5 | Annotation: You can download the annotation from here or from [HuggingFace](https://huggingface.co/datasets/V-STaR-Bench/V-STaR). 6 | 7 | ## Annotation Structure 8 | 9 | Each annotation is organized by 10 | 11 | ``` 12 | { 13 | "vid": ..., # Video ID 14 | "domain": ..., 15 | "fps": ..., 16 | "width": ..., 17 | "height": ..., 18 | "frame_count": ..., # total frame number of the video 19 | "question": ..., # VQA question 20 | "chain": "...", # spatio-temporal thinking chain 21 | "object": ..., # object of the boxes 22 | "answer": ..., 23 | "temporal_question": ..., # temporal grounding question 24 | "timestamps": [..., ...], 25 | "spatial_question": ..., # Chain 1 spatial grounding question 26 | "spatial_question_2": ..., # Chain 2 spatial grounding question 27 | "bboxes": [ 28 | {"timestamp":..., 29 | "frame_index":..., 30 | "xmin": ..., 31 | "ymin": ..., 32 | "xmax": ..., 33 | "ymax": ...}, 34 | ... 35 | ] 36 | } 37 | ``` 38 | 39 | -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | import json 2 | import numpy as np 3 | from tqdm import tqdm 4 | import math 5 | 6 | from transformers import AutoModelForCausalLM, AutoTokenizer 7 | import ast 8 | 9 | model_name = "Qwen/Qwen2.5-72B-Instruct" 10 | 11 | model = AutoModelForCausalLM.from_pretrained( 12 | model_name, 13 | torch_dtype="auto", 14 | device_map="auto" 15 | ) 16 | tokenizer = AutoTokenizer.from_pretrained(model_name) 17 | 18 | 19 | system_prompt = """ 20 | As an AI assistant, your task is to evaluate a candidate answer in comparison to a given correct answer. 21 | The question itself, the correct 'groundtruth' answer, and the candidate answer will be provided to you. 22 | Your assessment should range from 0 to 3, \ 23 | based solely on the semantic similarity between the groundtruth and the candidate answer, \ 24 | disregarding any grammatical differences. 25 | A rating of 0 suggests no similarity, implying the candidate answer is entirely incorrect. 26 | A rating of 1 suggests low similarity, meaning the candidate answer is largely incorrect. 27 | A rating of 2 suggests high similarity, meaning the candidate answer is largely correct. 28 | Lastly, a rating of 3 indicates complete similarity, which means the candidate answer is entirely correct. 29 | Your response should be a single integer from 0, 1, 2, or 3. 30 | """ 31 | 32 | # tmpl = 'Groundtruth answer: {}\nCandidate answer: {}\nYour response: ' 33 | tmpl = 'Question: {}\nGroundtruth answer: {}\nCandidate answer: {}\nYour response: ' 34 | 35 | def qwen2_5_evaluation(question, gt, candidate): 36 | user_prompt=tmpl.format(question, gt, candidate) 37 | messages = [ 38 | {"role": "system", "content": system_prompt}, 39 | {"role": "user", "content": user_prompt} 40 | ] 41 | text = tokenizer.apply_chat_template( 42 | messages, 43 | tokenize=False, 44 | add_generation_prompt=True 45 | ) 46 | model_inputs = tokenizer([text], return_tensors="pt").to(model.device) 47 | 48 | generated_ids = model.generate( 49 | **model_inputs, 50 | max_new_tokens=512 51 | ) 52 | generated_ids = [ 53 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) 54 | ] 55 | 56 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 57 | score = response 58 | # print(score) 59 | # breakpoint() 60 | try: 61 | score = int(score) 62 | except (ValueError, TypeError): 63 | score = -1 64 | return score 65 | 66 | 67 | def calculate_temporal_iou(gt_range, pred_range): 68 | """ calculate Temporal IoU""" 69 | if not pred_range: 70 | return 0.0 71 | 72 | 73 | if isinstance(pred_range, str): 74 | try: 75 | pred_range = ast.literal_eval(pred_range) 76 | except (ValueError, SyntaxError): 77 | return 0.0 78 | 79 | if not isinstance(pred_range, (list, tuple)) or len(pred_range) != 2 or \ 80 | not all(isinstance(x, (int, float)) for x in pred_range): 81 | return 0.0 82 | 83 | gt_start, gt_end = gt_range 84 | pred_start, pred_end = pred_range 85 | intersection = max(0, min(gt_end, pred_end) - max(gt_start, pred_start)) 86 | union = max(gt_end, pred_end) - min(gt_start, pred_start) 87 | return intersection / union if union > 0 else 0.0 88 | 89 | 90 | def compute_iou(gt_bbox, pred_bbox): 91 | """calculate 2 bbox IoU""" 92 | if not isinstance(pred_bbox, (list, tuple)) or len(pred_bbox) != 4: 93 | return 0.0 94 | 95 | # GT bbox 96 | gt_xmin, gt_ymin, gt_xmax, gt_ymax = gt_bbox['xmin'], gt_bbox['ymin'], gt_bbox['xmax'], gt_bbox['ymax'] 97 | pred_xmin, pred_ymin, pred_xmax, pred_ymax = pred_bbox 98 | 99 | # Intersection 100 | x1 = max(gt_xmin, pred_xmin) 101 | y1 = max(gt_ymin, pred_ymin) 102 | x2 = min(gt_xmax, pred_xmax) 103 | y2 = min(gt_ymax, pred_ymax) 104 | intersection = max(0, x2 - x1) * max(0, y2 - y1) 105 | 106 | # Union 107 | gt_area = (gt_xmax - gt_xmin) * (gt_ymax - gt_ymin) 108 | pred_area = (pred_xmax - pred_xmin) * (pred_ymax - pred_ymin) 109 | union = gt_area + pred_area - intersection 110 | 111 | return intersection / union if union > 0 else 0.0 112 | 113 | def calculate_bbox_iou(gt_bbox, pred_bboxes): 114 | """Calculate single BBox IoU, support multiple prediction frames to get maximum IoU""" 115 | try: 116 | if not pred_bboxes: 117 | return 0.0 118 | 119 | if isinstance(pred_bboxes[0], (int, float)) and len(pred_bboxes) == 4: 120 | pred_bboxes = [pred_bboxes] 121 | 122 | return max([compute_iou(gt_bbox, pred_bbox) for pred_bbox in pred_bboxes]) 123 | except: 124 | return 0.0 125 | 126 | def calculate_spatial_metrics(gt_bboxes, pred_bboxes): 127 | """calculate Spatial IoU and mAP""" 128 | if not pred_bboxes: 129 | return [0.0] * 5, 0.0 130 | 131 | iou_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9] 132 | ious = [] 133 | aps = [] 134 | for gt_bbox_entry in gt_bboxes: 135 | frame_id = str(box["timestamp"]) 136 | if frame_id in pred_bboxes: 137 | pred_bbox = pred_bboxes[frame_id] 138 | gt_bbox = { 139 | "xmin": box["xmin"], 140 | "ymin": box["ymin"], 141 | "xmax": box["xmax"], 142 | "ymax": box["ymax"] 143 | } 144 | iou = calculate_bbox_iou(gt_bbox, pred_bbox) 145 | ious.append(iou) 146 | else: 147 | ious.append(0.0) 148 | mIoU = np.mean(ious) if ious else 0.0 149 | 150 | for threshold in iou_thresholds: 151 | scores = [1 if iou >= threshold else 0 for iou in ious] 152 | if len(ious) > 0: 153 | aps.append(np.mean(scores)) 154 | else: 155 | aps.append(0.0) 156 | return aps, mIoU 157 | 158 | def calculate_spatial_random(gt_bboxes, w, h): 159 | """calculate Spatial IoU and mAP""" 160 | pred_bbox = [0, 0, w, h] 161 | iou_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9] 162 | ious = [] 163 | aps = [] 164 | for gt_bbox_entry in gt_bboxes: 165 | for frame_id, gt_bbox in gt_bbox_entry.items(): 166 | iou = calculate_bbox_iou(gt_bbox, pred_bbox) 167 | ious.append(iou) 168 | mIoU = np.mean(ious) if ious else 0.0 169 | 170 | for threshold in iou_thresholds: 171 | scores = [1 if iou >= threshold else 0 for iou in ious] 172 | if len(ious) > 0: 173 | aps.append(np.mean(scores)) 174 | else: 175 | aps.append(0.0) 176 | return aps, mIoU 177 | 178 | # evaluate the json file 179 | def evaluate_json(file_path): 180 | with open(file_path, 'r') as f: 181 | data = json.load(f) 182 | 183 | model_name = file_path.split("/")[-1].split("_")[0] 184 | domains = {} 185 | durations = {} 186 | overall_stats = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)], 187 | "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[], 188 | "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[], 189 | "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]} 190 | 191 | 192 | for idx, item in enumerate(tqdm(data, desc=f"Evaluating {model_name} results", unit="item")): 193 | video_length = round(item['frame_count']/item['fps'], 2) 194 | w, h = item['width'], item['height'] 195 | domain = item.get("domain", "unknown") 196 | if domain not in domains: 197 | domains[domain] = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)], 198 | "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[], 199 | "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[], 200 | "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]} 201 | 202 | if video_length < 60: 203 | duration = "Short" 204 | elif 60 <= video_length < 180: 205 | duration = "Medium" 206 | else: 207 | duration = "Long" 208 | if duration not in durations: 209 | durations[duration] = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)], 210 | "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[], 211 | "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[], 212 | "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]} 213 | 214 | if 'answer_vqa' in item and item['answer_vqa']: 215 | score = qwen2_5_evaluation(item['question'], item['answer'], item['answer_vqa']) 216 | else: 217 | continue 218 | overall_stats["all_rating"].append(score if score != -1 else 0) 219 | domains[domain]["all_rating"].append(score if score != -1 else 0) 220 | durations[duration]["all_rating"].append(score if score != -1 else 0) 221 | if score != -1: 222 | overall_stats["valid_rating"].append(score) 223 | domains[domain]["valid_rating"].append(score) 224 | durations[duration]["valid_rating"].append(score) 225 | if score >= 2: 226 | overall_stats["correct_num"] += 1 227 | domains[domain]["correct_num"] += 1 228 | durations[duration]["correct_num"] += 1 229 | data[idx]["VQA_score"] = score 230 | # answer_temporal 231 | if 'answer_temporal' in item and item['answer_temporal']: 232 | temporal_iou = calculate_temporal_iou(item['timestamps'], item['answer_temporal']) 233 | else: 234 | temporal_iou = 0.0 235 | 236 | overall_stats["temporal_ious"].append(temporal_iou) 237 | domains[domain]["temporal_ious"].append(temporal_iou) 238 | durations[duration]["temporal_ious"].append(temporal_iou) 239 | data[idx]["temporal_IoU"] = temporal_iou 240 | 241 | # answer_temporal_2 242 | if 'answer_temporal_2' in item and item['answer_temporal_2']: 243 | temporal_iou_2 = calculate_temporal_iou(item['timestamps'], item['answer_temporal_2']) 244 | else: 245 | temporal_iou_2 = 0.0 246 | 247 | overall_stats["temporal_ious_2"].append(temporal_iou_2) 248 | domains[domain]["temporal_ious_2"].append(temporal_iou_2) 249 | durations[duration]["temporal_ious_2"].append(temporal_iou_2) 250 | data[idx]["temporal_IoU_2"] = temporal_iou_2 251 | 252 | random_iou = calculate_temporal_iou(item['timestamps'],[0, video_length]) 253 | overall_stats["random_tious"].append(random_iou) 254 | domains[domain]["random_tious"].append(random_iou) 255 | durations[duration]["random_tious"].append(random_iou) 256 | 257 | # answer_spatial 258 | if 'answer_spatial' in item and item['answer_spatial']: 259 | aps, mIoU = calculate_spatial_metrics(item['bboxes'], item['answer_spatial']) 260 | else: 261 | aps, mIoU = [0.0] * 5, 0.0 262 | for i, ap in enumerate(aps): 263 | domains[domain]["spatial_aps"][i].append(ap) 264 | durations[duration]["spatial_aps"][i].append(ap) 265 | overall_stats["spatial_aps"][i].append(ap) 266 | domains[domain]["spatial_mious"].append(mIoU) 267 | durations[duration]["spatial_mious"].append(mIoU) 268 | overall_stats["spatial_mious"].append(mIoU) 269 | data[idx]["AP1@0.1:0.9"] = aps 270 | data[idx]["spatial_mIoU"] = mIoU 271 | 272 | # answer_spatial_2 273 | if 'answer_spatial_2' in item and item['answer_spatial_2']: 274 | aps_2, mIoU_2 = calculate_spatial_metrics(item['bboxes'], item['answer_spatial_2']) 275 | else: 276 | aps_2, mIoU_2 = [0.0] * 5, 0.0 277 | for i, ap in enumerate(aps_2): 278 | domains[domain]["spatial_aps_2"][i].append(ap) 279 | durations[duration]["spatial_aps_2"][i].append(ap) 280 | overall_stats["spatial_aps_2"][i].append(ap) 281 | domains[domain]["spatial_mious_2"].append(mIoU_2) 282 | durations[duration]["spatial_mious_2"].append(mIoU_2) 283 | overall_stats["spatial_mious_2"].append(mIoU_2) 284 | data[idx]["AP2@0.1:0.9"] = aps_2 285 | data[idx]["spatial_mIoU_2"] = mIoU_2 286 | 287 | 288 | random_aps, random_mIoU = calculate_spatial_random(item['bboxes'], w, h) 289 | for i, ap in enumerate(random_aps): 290 | domains[domain]["random_aps"][i].append(ap) 291 | durations[duration]["random_aps"][i].append(ap) 292 | overall_stats["random_aps"][i].append(ap) 293 | domains[domain]["random_vious"].append(random_mIoU) 294 | durations[duration]["random_vious"].append(random_mIoU) 295 | overall_stats["random_vious"].append(random_mIoU) 296 | 297 | with open(f'metrics/{model_name}_merged_v2_metrics.json', 'w') as f: 298 | json.dump(data, f, indent=4) 299 | 300 | if score >= 2 and temporal_iou >= 0.3: 301 | domains[domain]["vqa_temporal_idx"].append(idx) 302 | durations[duration]["vqa_temporal_idx"].append(idx) 303 | overall_stats["vqa_temporal_idx"].append(idx) 304 | if score >= 2 and temporal_iou_2 >= 0.3: 305 | domains[domain]["vqa_temporal_idx_2"].append(idx) 306 | durations[duration]["vqa_temporal_idx_2"].append(idx) 307 | overall_stats["vqa_temporal_idx_2"].append(idx) 308 | if score >= 2 and mIoU >= 0.1: 309 | domains[domain]["vqa_spatial_idx"].append(idx) 310 | durations[duration]["vqa_spatial_idx"].append(idx) 311 | overall_stats["vqa_spatial_idx"].append(idx) 312 | if score >= 2 and mIoU_2 >= 0.1: 313 | domains[domain]["vqa_spatial_idx_2"].append(idx) 314 | durations[duration]["vqa_spatial_idx_2"].append(idx) 315 | overall_stats["vqa_spatial_idx_2"].append(idx) 316 | if temporal_iou >= 0.3 and mIoU >= 0.1: 317 | domains[domain]["temporal_spatial_idx"].append(idx) 318 | durations[duration]["temporal_spatial_idx"].append(idx) 319 | overall_stats["temporal_spatial_idx"].append(idx) 320 | if temporal_iou_2 >= 0.3 and mIoU_2 >= 0.1: 321 | domains[domain]["temporal_spatial_idx_2"].append(idx) 322 | durations[duration]["temporal_spatial_idx_2"].append(idx) 323 | overall_stats["temporal_spatial_idx_2"].append(idx) 324 | if score >= 2 and temporal_iou >= 0.3 and mIoU >= 0.1: 325 | domains[domain]["vqa_temp_spatial_idx"].append(idx) 326 | durations[duration]["vqa_temp_spatial_idx"].append(idx) 327 | overall_stats["vqa_temp_spatial_idx"].append(idx) 328 | if score >= 2 and temporal_iou_2 >= 0.3 and mIoU_2 >= 0.1: 329 | domains[domain]["vqa_temp_spatial_idx_2"].append(idx) 330 | durations[duration]["vqa_temp_spatial_idx_2"].append(idx) 331 | overall_stats["vqa_temp_spatial_idx_2"].append(idx) 332 | 333 | with open(f'metrics/{model_name}_merged_v2_metrics.json', 'w') as f: 334 | json.dump(data, f, indent=4) 335 | 336 | def print_stats(label, stats, total_samples): 337 | avg_all_score = np.mean(stats["all_rating"]) 338 | avg_valid_score = np.mean(stats["valid_rating"]) if stats["valid_rating"] else 0 339 | acc_vqa = stats["correct_num"] / total_samples 340 | 341 | r1_iou30 = np.mean([1 if iou >= 0.3 else 0 for iou in stats["temporal_ious"]]) 342 | r1_iou50 = np.mean([1 if iou >= 0.5 else 0 for iou in stats["temporal_ious"]]) 343 | r1_iou70 = np.mean([1 if iou >= 0.7 else 0 for iou in stats["temporal_ious"]]) 344 | mean_temporal_iou = np.mean(stats["temporal_ious"]) 345 | 346 | r1_iou30_2 = np.mean([1 if iou >= 0.3 else 0 for iou in stats["temporal_ious_2"]]) 347 | r1_iou50_2 = np.mean([1 if iou >= 0.5 else 0 for iou in stats["temporal_ious_2"]]) 348 | r1_iou70_2 = np.mean([1 if iou >= 0.7 else 0 for iou in stats["temporal_ious_2"]]) 349 | mean_temporal_iou_2 = np.mean(stats["temporal_ious_2"]) 350 | 351 | mean_aps = [np.mean(ar_list) for ar_list in stats["spatial_aps"]] 352 | mean_miou = np.mean(stats["spatial_mious"]) 353 | 354 | mean_aps_2 = [np.mean(ar_list) for ar_list in stats["spatial_aps_2"]] 355 | mean_miou_2 = np.mean(stats["spatial_mious_2"]) 356 | 357 | 358 | vqa_temp = len(stats["vqa_temporal_idx"]) / total_samples 359 | vqa_temp_2 = len(stats["vqa_temporal_idx_2"]) / total_samples 360 | vqa_spat = len(stats["vqa_spatial_idx"]) / total_samples 361 | vqa_spat_2 = len(stats["vqa_spatial_idx_2"]) / total_samples 362 | temp_spat = len(stats["temporal_spatial_idx"]) / total_samples 363 | temp_spat_2 = len(stats["temporal_spatial_idx_2"]) / total_samples 364 | vqa_temp_spat = len(stats["vqa_temp_spatial_idx"]) / total_samples 365 | vqa_temp_spat_2 = len(stats["vqa_temp_spatial_idx_2"]) / total_samples 366 | 367 | print(f"{label}:") 368 | print(f"VQA: Avg All Score: {avg_all_score:.4f}, Avg Valid Score: {avg_valid_score:.4f}, Accuracy: {acc_vqa:.4f}") 369 | print("Chain 1:") 370 | print(f"Temporal Answer: R1@IoU=0.3: {r1_iou30:.4f}, R1@IoU=0.5: {r1_iou50:.4f}, R1@IoU=0.7: {r1_iou70:.4f}, Mean IoU: {mean_temporal_iou:.4f}") 371 | print(f"Spatial Answer: mAP@0.1: {mean_aps[0]:.4f}, mAP@0.3: {mean_aps[1]:.4f}, mAP@0.5: {mean_aps[2]:.4f}, mAP@0.7: {mean_aps[3]:.4f}, mAP@0.9: {mean_aps[4]:.4f}, Mean mIoU: {mean_miou:.4f}") 372 | print("\n") 373 | print("Chain 2:") 374 | print(f"Temporal Answer: R1@IoU=0.3: {r1_iou30_2:.4f}, R1@IoU=0.5: {r1_iou50_2:.4f}, R1@IoU=0.7: {r1_iou70_2:.4f}, Mean IoU: {mean_temporal_iou_2:.4f}") 375 | print(f"Spatial Answer: mAP@0.1: {mean_aps_2[0]:.4f}, mAP@0.3: {mean_aps_2[1]:.4f}, mAP@0.5: {mean_aps_2[2]:.4f}, mAP@0.7: {mean_aps_2[3]:.4f}, mAP@0.9: {mean_aps_2[4]:.4f}, Mean mIoU: {mean_miou_2:.4f}") 376 | print("\n") 377 | 378 | AM = (acc_vqa + mean_temporal_iou + mean_miou)/3 379 | AM2 = (acc_vqa + mean_temporal_iou_2 + mean_miou_2)/3 380 | mAM = (AM + AM2) / 2 381 | 382 | LGM = -(math.log(1 - acc_vqa) + math.log(1 - mean_temporal_iou) + math.log(1 - mean_miou)) / 3 383 | LGM2 = -(math.log(1 - acc_vqa) + math.log(1 - mean_temporal_iou_2) + math.log(1 - mean_miou_2)) / 3 384 | mLGM = (LGM + LGM2) / 2 385 | 386 | print(f"AM1:{AM:.4f}, AM2:{AM2:.4f}, mAM:{mAM:.4f}") 387 | print(f"LGM1:{LGM:.4f}, LGM2:{LGM2:.4f}, mLGM:{mLGM:.4f}\n") 388 | 389 | print("Combined resutls:") 390 | print(f"VQA & Temp: Chain 1: {vqa_temp:.4f}, Chain 2: {vqa_temp_2:.4f}") 391 | print(f"VQA & Spat: Chain 1: {vqa_spat:.4f} Chain 2: {vqa_spat_2:.4f}") 392 | print(f"Temp & Spat: Chain 1: {temp_spat:.4f} Chain 2: {temp_spat_2:.4f}") 393 | print(f"VQA & Temp & Spat: Chain 1:{vqa_temp_spat:.4f} Chain 2: {vqa_temp_spat_2:.4f}") 394 | print(f"VQA & Temp list: \n Chain 1:{stats['vqa_temporal_idx']} \nChain 2:{stats['vqa_temporal_idx_2']}") 395 | print(f"VQA & Spat list: \n Chain 1:{stats['vqa_spatial_idx']} \n Chain 2: {stats['vqa_spatial_idx_2']}") 396 | print(f"Temp & Spat list: \n Chain 1:{stats['temporal_spatial_idx']} \n Chain 2: {stats['temporal_spatial_idx_2']}") 397 | print(f"VQA & Temp & Spat list: \n Chain 1:{stats['vqa_temp_spatial_idx']} \n Chain 2:{stats['vqa_temp_spatial_idx_2']}\n") 398 | 399 | print_stats("Overall Statistics", overall_stats, len(data)) 400 | for duration, stats in durations.items(): 401 | print_stats(f"Video Length: {duration}", stats, len(stats["all_rating"])) 402 | for domain, stats in domains.items(): 403 | print_stats(f"Domain: {domain}", stats, len(stats["all_rating"])) 404 | 405 | 406 | # print("\nEvaluating GPT-4o:\n") 407 | # evaluate_json('results/gpt4o/gpt4o_answer_update_merged.json') 408 | 409 | # print("\nEvaluating Gemini-2-Flash:\n") 410 | # evaluate_json('results/gemini2/gemini2_answer_merged.json') 411 | 412 | # print("\nEvaluating Video-Llama3:\n") 413 | # evaluate_json('results/videollama3/videollama3_answer_merged.json') 414 | 415 | print("\nEvaluating Qwen2.5-VL:") 416 | evaluate_json('results/qwen2_5/qwen2-5-32B_answer_update.json') 417 | 418 | # print("\nEvaluating InternVL-2.5-38B:") 419 | # evaluate_json('results/internvl2_5/internvl2-5-38B_answer.json') 420 | 421 | # print("\nEvaluating InternVL-2.5:") 422 | # evaluate_json('results/internvl2_5/internvl2_5_answer_merged.json') 423 | 424 | # print("\nEvaluating Llava-Video:") 425 | # evaluate_json('results/llava-video/llavavideo_answer_merged.json') 426 | 427 | # print("\nEvaluating Qwen2-VL:") 428 | # evaluate_json('results/qwen2/qwen2vl_answer_merged.json') 429 | 430 | # print("\nEvaluating VideoChat2:") 431 | # evaluate_json('results/videochat2/videochat2_answer_merged.json') 432 | 433 | # print("\nEvaluating Oryx-1.5:") 434 | # evaluate_json('results/oryx-1.5/oryx15_answer_merged.json') 435 | 436 | # print("\nEvaluating VideoCCAM:") 437 | # evaluate_json('results/videoccam/videoccam12_answer_merged.json') 438 | 439 | # print("\nEvaluating TimeChat:") 440 | # evaluate_json('results/timechat/timechat_answer_merged.json') 441 | 442 | # print("\nEvaluating VTimeLLM:") 443 | # evaluate_json('results/vtimellm/vtimellm_answer_merged.json') 444 | 445 | # print("\nEvaluating Trace:") 446 | # evaluate_json('results/trace/trace_answer_merged.json') 447 | 448 | # print("\nEvaluating Sa2VA:") 449 | # evaluate_json('results/sa2va/sa2va_answer_merged.json') 450 | 451 | 452 | -------------------------------------------------------------------------------- /inference_demo.py: -------------------------------------------------------------------------------- 1 | import cv2 # We're using OpenCV to read video, to install !pip install opencv-python 2 | import base64 3 | import time 4 | import os 5 | import requests 6 | from tqdm import tqdm 7 | import json 8 | import re 9 | import torch 10 | import math 11 | 12 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor 13 | from qwen_vl_utils import process_vision_info 14 | 15 | model_path = "Qwen/Qwen2.5-VL-7B-Instruct" 16 | 17 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( 18 | model_path, 19 | torch_dtype=torch.bfloat16, 20 | attn_implementation="flash_attention_2", 21 | device_map="auto" 22 | ) 23 | processor = AutoProcessor.from_pretrained(model_path) 24 | 25 | def inference(video_path, prompt, max_new_tokens=2048, total_pixels=20480 * 28 * 28, min_pixels=16 * 28 * 28): 26 | messages = [ 27 | {"role": "system", "content": "You are a helpful assistant."}, 28 | {"role": "user", "content": [ 29 | {"type": "text", "text": prompt}, 30 | {"type": "video", "video": video_path, "total_pixels": total_pixels, "min_pixels": min_pixels, "fps": 1.0}, 31 | ] 32 | }, 33 | ] 34 | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 35 | image_inputs, video_inputs, video_kwargs = process_vision_info([messages], return_video_kwargs=True) 36 | fps_inputs = video_kwargs['fps'] 37 | print("video input:", video_inputs[0].shape) 38 | num_frames, _, resized_height, resized_width = video_inputs[0].shape 39 | # print("num of video tokens:", int(num_frames / 2 * resized_height / 28 * resized_width / 28)) 40 | inputs = processor(text=[text], images=image_inputs, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt") 41 | inputs = inputs.to('cuda') 42 | 43 | output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens) 44 | generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] 45 | output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) 46 | return output_text[0], resized_height, resized_width 47 | 48 | def read_anno(anno_file): 49 | with open(anno_file, 'r') as f: 50 | data = json.load(f) 51 | return data 52 | 53 | def find_video(video_folder, vid): 54 | """ 55 | Finds the vid.mp4 file in the video_folder and its subfolders. 56 | 57 | Args. 58 | video_folder (str): path of the folder to search. 59 | vid (str): the filename of the video (without extension). 60 | 61 | Returns. 62 | str: absolute path to the vid.mp4 file, or None if not found. 63 | """ 64 | target_filename = f"{vid}.mp4" 65 | for root, _, files in os.walk(video_folder): 66 | if target_filename in files: 67 | return os.path.join(root, target_filename) 68 | return None 69 | 70 | def get_answer_vqa(data, video_path): 71 | prompt = f"Answer the question about the video: {data['question']} \n (If the answer is a person, you don't need to identify the person.)" 72 | answer_vqa, _, _ = inference(video_path, prompt) 73 | return answer_vqa 74 | 75 | def get_answer_temporal(data, video_path): 76 | video_length = round(data['frame_count']/data['fps'], 2) 77 | temporal_question = data['temporal_question'] 78 | prompt = f"This video is {video_length} seconds long. Answer the question about the video: {temporal_question} \n Output the start and end moment timestamps.", 79 | answer_temporal, _, _ = inference(video_path, prompt) 80 | return answer_temporal 81 | 82 | def get_answer_temporal_2(data, video_path, bboxes): 83 | video_length = round(data['frame_count']/data['fps'], 2) 84 | temporal_question = data['temporal_question'] 85 | w, h = data['width'], data['height'] 86 | prompt = f"This video is {video_length} seconds long with a resolution of {w}x{h} (width x height). Answer the question about the video: {temporal_question} \n There are {len(bboxes)} bounding boxes of the key object related to the question in the video without knowing the time, which are:{bboxes}. Output the start and end moment timestamps.", 87 | answer_temporal, _, _ = inference(video_path, prompt) 88 | return answer_temporal 89 | 90 | def get_answer_spatial(data, video_path): 91 | video_length = round(data['frame_count']/data['fps'], 2) 92 | st, et = math.ceil(data['timestamps'][0]), math.floor(data['timestamps'][1]) 93 | time_range = list(range(st, et + 1)) 94 | w, h = data['width'], data['height'] 95 | spatial_question = data['spatial_question'] 96 | prompt = f"""Please answer the question about the video: {spatial_question} with a series of bounding boxes in [x1, y1, x2, y2] format. \n 97 | For each whole second within the time range {time_range} provided (inclusive of the boundaries), output a series of bounding boxes of the object in JSON format. The keys should be the whole seconds (as strings), and the values should be the box in [x1, y1, x2, y2] format. 98 | Example output: {{"{time_range[0]}": [x1, y1, x2, y2],...}} 99 | """ 100 | answer_spatial, input_height, input_width = inference(video_path, prompt) 101 | return answer_spatial, input_height, input_width 102 | 103 | def get_answer_spatial_2(data, video_path, bboxes): 104 | video_length = round(data['frame_count']/data['fps'], 2) 105 | st, et = math.ceil(data['timestamps'][0]), math.floor(data['timestamps'][1]) 106 | time_range = list(range(st, et + 1)) 107 | w, h = data['width'], data['height'] 108 | spatial_question = data['spatial_question_2'] 109 | prompt = f"""Please answer the question about the video: {spatial_question} with a series of bounding boxes in [x1, y1, x2, y2] format. \n 110 | For each whole second that may related to the question, output a series of bounding boxes of the object in JSON format. You only need to output {len(bboxes)} bbox(es). You need to determine which frame is related to the question, and you don't need to output the bbox for the frames not related to the question. 111 | The keys should be the whole seconds (as strings), and the values should be the bounding box in [x0,y0,x1,y1] format. 112 | \n Example output: 113 | {{"0": [x0,y0,x1,y1], "1":..., ..., "{len(bboxes)}":...}} (if the frames at 0~{len(bboxes)} second are related to the questions) 114 | """ 115 | answer_spatial, input_height, input_width = inference(video_path, prompt) 116 | return answer_spatial, input_height, input_width 117 | 118 | def extract_timestamps(result): 119 | """extract timestamps from the answer""" 120 | match = re.findall(r"\b\d+(?:\.\d+)?\b", result) 121 | return [float(match[0]), float(match[1])] if len(match) == 2 else [] 122 | 123 | 124 | def fix_incomplete_json(json_str): 125 | """ 126 | fix the incomplete brackets of the json 127 | """ 128 | # Counting left and right brackets 129 | open_square = json_str.count('[') 130 | close_square = json_str.count(']') 131 | open_curly = json_str.count('{') 132 | close_curly = json_str.count('}') 133 | 134 | # Complete the square brackets 135 | if open_square > close_square: 136 | json_str += ']' * (open_square - close_square) 137 | elif close_square > open_square: 138 | json_str = '[' * (close_square - open_square) + json_str 139 | 140 | # Complete the curly brackets 141 | if open_curly > close_curly: 142 | json_str += '}' * (open_curly - close_curly) 143 | elif close_curly > open_curly: 144 | json_str = '{' * (close_curly - open_curly) + json_str 145 | 146 | return json_str 147 | 148 | 149 | def extract_bounding_boxes(answer_spatial, data, input_height, input_width): 150 | """ 151 | Extract bounding boxes from the input answer_spatial and denormalize the coordinates using the width and height from the data. 152 | """ 153 | w, h = data['width'], data['height'] 154 | 155 | def denormalize_bbox(bbox): 156 | """ 157 | denormalize the coordinates of bbox 158 | """ 159 | try: 160 | if len(bbox) == 1: 161 | bbox = bbox[0] 162 | if len(bbox) == 2: 163 | bbox = bbox[1] 164 | x_min = int(bbox[0] / input_width * w) 165 | y_min = int(bbox[1] / input_height * h) 166 | x_max = int(bbox[2] / input_width * w) 167 | y_max = int(bbox[3] / input_height * h) 168 | return [x_min, y_min, x_max, y_max] 169 | except Exception as e: 170 | print(f"Processing {bbox} occurs Error {e}") 171 | return bbox 172 | 173 | # match markdown json 174 | markdown_pattern = r'```json\s*\n(\[.*?\]|\{.*?\})\s*\n```' 175 | match = re.search(markdown_pattern, answer_spatial, re.DOTALL) 176 | if not match: 177 | # If there is no Markdown wrapper, then try to match the JSON format directly 178 | json_pattern = r'(\[[\s\S]*\]|\{[\s\S]*\})' 179 | match = re.search(json_pattern, answer_spatial, re.DOTALL) 180 | if match: 181 | # match bbox in JSON 182 | bounding_boxes_str = match.group(1).strip() 183 | # Replace single quotes with double quotes to conform to the JSON specification 184 | bounding_boxes_str = bounding_boxes_str.replace("'", '"') 185 | try: 186 | # Convert strings to dictionary or list format 187 | bounding_boxes = json.loads(bounding_boxes_str) 188 | # If it's a list and contains a dictionary inside, expand it to a single dictionary 189 | if isinstance(bounding_boxes, list) and all(isinstance(item, dict) for item in bounding_boxes): 190 | combined_dict = {} 191 | for item in bounding_boxes: 192 | combined_dict.update(item) 193 | bounding_boxes = combined_dict 194 | # Determine if the extracted JSON is a dictionary or a list. 195 | if isinstance(bounding_boxes, list): 196 | # bounding boxes in list 197 | return {str(box[0]): box[1] for box in bounding_boxes} 198 | elif isinstance(bounding_boxes, dict): 199 | # bounding boxes in dictionary 200 | return {key: value for key, value in bounding_boxes.items()} 201 | except Exception as e: 202 | # if failed, try to fix it. 203 | fixed_bounding_boxes_str = fix_incomplete_json(bounding_boxes_str) 204 | try: 205 | bounding_boxes = json.loads(fixed_bounding_boxes_str) 206 | if isinstance(bounding_boxes, list): 207 | return [box for box in bounding_boxes] 208 | elif isinstance(bounding_boxes, dict): 209 | return {key: value for key, value in bounding_boxes.items()} 210 | except Exception as e: 211 | print(f"Failed after fixing: {e}\nExtracted JSON: {fixed_bounding_boxes_str}") 212 | return None 213 | else: 214 | print("No match found for the bounding box JSON.") 215 | return None 216 | 217 | def test_qwen2_5vl(video_folder, anno_file, result_file): 218 | anno = read_anno(anno_file) 219 | 220 | for idx, data in enumerate(tqdm(anno, desc="Processing videos", unit="video")): 221 | try: 222 | vid = data['vid'] 223 | timestamps = data['timestamps'] 224 | video_length = round(data['frame_count']/data['fps'], 1) 225 | boxes = [[box_data["xmin"], box_data["ymin"], box_data["xmax"], box_data["ymax"]] \ 226 | for box in data["bboxes"] for box_data in box.values()] 227 | video_path = find_video(video_folder, vid) 228 | answer_vqa = get_answer_vqa(data, video_path) 229 | # chain one 230 | answer_temporal = get_answer_temporal(data, video_path) 231 | answer_temporal_post = extract_timestamps(answer_temporal) 232 | 233 | answer_spatial, input_height, input_width = get_answer_spatial(data, video_path) 234 | answer_spatial_post = extract_bounding_boxes(answer_spatial, data, input_height, input_width) 235 | 236 | # chain two 237 | answer_spatial_2, input_height, input_width = get_answer_spatial_2(data, video_path, boxes) 238 | answer_spatial_post_2 = extract_bounding_boxes(answer_spatial_2, data, input_height, input_width) 239 | 240 | answer_temporal_2 = get_answer_temporal_2(data, video_path, boxes) 241 | answer_temporal_post_2 = extract_timestamps(answer_temporal_2) 242 | 243 | # update data 244 | data['answer_vqa'] = answer_vqa 245 | data['answer_temporal_pre'] = answer_temporal 246 | data['answer_temporal'] = answer_temporal_post 247 | data['answer_spatial_pre'] = answer_spatial 248 | data['answer_spatial'] = answer_spatial_post 249 | 250 | data['answer_spatial_pre_2'] = answer_spatial_2 251 | data['answer_spatial_2'] = answer_spatial_post_2 252 | data['answer_temporal_pre_2'] = answer_temporal_2 253 | data['answer_temporal_2'] = answer_temporal_post_2 254 | # update result_file 255 | with open(result_file, 'w') as f: 256 | json.dump(anno, f, indent=4) 257 | except Exception as e: 258 | print("") 259 | print(f"ERROR in data {idx}: {e}.") 260 | continue 261 | 262 | 263 | if __name__ == "__main__": 264 | video_folder = "/Path/to/video/folder" 265 | anno_file = "/path/to/anno/file.json" 266 | result_file = "/path/to/result/file.json" 267 | test_qwen2_5vl(video_folder, anno_file, result_file) 268 | --------------------------------------------------------------------------------