├── .gitignore
├── README.md
├── assets
├── radar.png
├── v_star_logo.png
├── vis.png
└── vis_v2.png
├── data
├── README.md
└── V_STaR_test.json
├── eval.py
└── inference_demo.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 |
4 | [](https://arxiv.org/pdf/2503.11495)
5 | [](https://huggingface.co/spaces/V-STaR-Bench/V-STaR-LeaderBoard)
6 | [](https://huggingface.co/datasets/V-STaR-Bench/V-STaR)
7 | [](https://v-star-bench.github.io/)
8 | [](https://github.com/lwpyh/Awesome-MLLM-Reasoning-Collection)
9 |
10 |
11 |
12 | This repository contains the implementation of the following paper and its related serial works in progress. We evaluate Video LLMs models!
13 | > **V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning**
14 | Zixu Cheng,
15 | Jian Hu+,
16 | Ziquan Liu,
17 | Chenyang Si,
18 | Wei Li,
19 | Shaogang Gong
20 |
21 |
22 |
23 |
24 | ### Table of Contents
25 | - [Updates](#updates)
26 | - [Overview](#overview)
27 | - [Evaluation Results](#evaluation_results)
28 | - [Usage](#usage)
29 | - [Citation and Acknowledgement](#citation_and_acknowledgement)
30 |
31 |
32 | ## :mega: Updates
33 | - [03/2025] **MLLM reasoning collections released:** We’re excited to announce the release of the [Awesome-MLLM-Reasoning-Collection](https://github.com/lwpyh/Awesome-MLLM-Reasoning-Collection) repository! 🚀 This collection curates papers, code, datasets, benchmarks, and resources dedicated to MLLM reasoning, providing a valuable resource for researchers and practitioners in the field. Check it out and contribute! 🔍📚💡
34 | - [03/2025] **Dataset released:** Our evaluation dataset is released.
35 | - [03/2025] **Code and leaderboard released:** Our V-STAR benchmark is released.
36 |
37 | ## :mega: Overview
38 | 
39 | Overview of V-STaR. Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.
40 |
41 |
42 |
43 | ## :mortar_board: Evaluation Results
44 |
45 | ***See our leaderboard for the most updated ranking and numerical results (with models like GPT-4o, Gemini-2-flash and Qwen2.5-VL)***. [](https://v-star-bench-v-star-leaderboard.hf.space/)
46 |
47 | We visualize the evaluation results of the 6 most recent top-performing Video-LLMs across 9 V-STaR domains.
48 |
49 | We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains.
50 |
51 |
52 |
53 |
54 |
55 | ## :trophy: Leaderboard
56 |
57 | See numeric values at our [Leaderboard](https://huggingface.co/spaces/V-STaR-Bench/V-STaR-LeaderBoard) :1st_place_medal::2nd_place_medal::3rd_place_medal:
58 |
59 | #### **How to join VBench Leaderboard?**
60 |
61 | please contact us via email (jian.hu@qmul.ac.uk or zixu.cheng@qmul.ac.uk) to update your results.
62 |
63 | #### Evaluation Criterion
64 |
65 | To evaluate the open-ended *"what"* question, we use Qwen2.5-72B-Instruct to score answers from 0 to 4, denoting entirely incorrect, largely incorrect, largely correct, and entirely correct. Answers scoring above 2 are considered correct, allowing us to compute accuracy.
66 |
67 | For the *"when"* question, we follow the commonly used temporal grounding metrics, , which refers to the percentage of top-n prediction with temporal IoU score larger than m, and mean temporal IoU score (m\_tIoU).
68 |
69 | For the *"where"* question, we use the Average Precision score (AP@vIoU=m) and mean visual Intersection over the Union (m\_vIoU) of every annotated frame. We follow the proposed LGM and AM to measure a model's spatial-temporal reasoning ability. A higher LGM indicates a better overall spatio-temporal reasoning ability of the model, and a higher AM indicates a more average performance of the model on the three metrics.
70 |
71 |
72 | ## Usage
73 | ##### Download V-STaR dataset on HuggingFace
74 |
75 | ```
76 | git lfs install
77 | git clone https://huggingface.co/datasets/V-STaR-Bench/V-STaR
78 | ```
79 |
80 | ##### Use V-STaR to evaluate Video-LLMs
81 |
82 | We provide our inference_demo.py script to test Qwen2.5-VL-7B with:
83 |
84 | ```
85 | python inference_demo.py
86 | ```
87 | You can try your Video-LLMs to infer on V-STaR based on the provided scripts to test the model's spatio-temporal reasoning ability.
88 |
89 | To evaluate the results, update your result file path in the eval.py script and run:
90 |
91 | ```
92 | python eval.py
93 | ```
94 | Noted: You need at least 2 NVIDIA A100 80G GPUs to run Qwen-2.5-72B for evaluation.
95 |
96 | ## Submit to Leaderboard
97 |
98 | please contact us via email (jian.hu@qmul.ac.uk or zixu.cheng@qmul.ac.uk) to update your results.
99 |
100 |
101 |
102 | ## :black_nib: Citation
103 |
104 | If you find our repo useful for your research, please consider citing our paper:
105 |
106 | ```bibtex
107 | @misc{cheng2025vstarbenchmarkingvideollmsvideo,
108 | title={V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning},
109 | author={Zixu Cheng and Jian Hu and Ziquan Liu and Chenyang Si and Wei Li and Shaogang Gong},
110 | year={2025},
111 | eprint={2503.11495},
112 | archivePrefix={arXiv},
113 | primaryClass={cs.CV},
114 | url={https://arxiv.org/abs/2503.11495},
115 | }
116 | ```
117 |
--------------------------------------------------------------------------------
/assets/radar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/radar.png
--------------------------------------------------------------------------------
/assets/v_star_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/v_star_logo.png
--------------------------------------------------------------------------------
/assets/vis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/vis.png
--------------------------------------------------------------------------------
/assets/vis_v2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/vis_v2.png
--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
1 | # **Download the Dataset**
2 |
3 | Video: Please down the video data from [HuggingFace](https://huggingface.co/datasets/V-STaR-Bench/V-STaR).
4 |
5 | Annotation: You can download the annotation from here or from [HuggingFace](https://huggingface.co/datasets/V-STaR-Bench/V-STaR).
6 |
7 | ## Annotation Structure
8 |
9 | Each annotation is organized by
10 |
11 | ```
12 | {
13 | "vid": ..., # Video ID
14 | "domain": ...,
15 | "fps": ...,
16 | "width": ...,
17 | "height": ...,
18 | "frame_count": ..., # total frame number of the video
19 | "question": ..., # VQA question
20 | "chain": "...", # spatio-temporal thinking chain
21 | "object": ..., # object of the boxes
22 | "answer": ...,
23 | "temporal_question": ..., # temporal grounding question
24 | "timestamps": [..., ...],
25 | "spatial_question": ..., # Chain 1 spatial grounding question
26 | "spatial_question_2": ..., # Chain 2 spatial grounding question
27 | "bboxes": [
28 | {"timestamp":...,
29 | "frame_index":...,
30 | "xmin": ...,
31 | "ymin": ...,
32 | "xmax": ...,
33 | "ymax": ...},
34 | ...
35 | ]
36 | }
37 | ```
38 |
39 |
--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
1 | import json
2 | import numpy as np
3 | from tqdm import tqdm
4 | import math
5 |
6 | from transformers import AutoModelForCausalLM, AutoTokenizer
7 | import ast
8 |
9 | model_name = "Qwen/Qwen2.5-72B-Instruct"
10 |
11 | model = AutoModelForCausalLM.from_pretrained(
12 | model_name,
13 | torch_dtype="auto",
14 | device_map="auto"
15 | )
16 | tokenizer = AutoTokenizer.from_pretrained(model_name)
17 |
18 |
19 | system_prompt = """
20 | As an AI assistant, your task is to evaluate a candidate answer in comparison to a given correct answer.
21 | The question itself, the correct 'groundtruth' answer, and the candidate answer will be provided to you.
22 | Your assessment should range from 0 to 3, \
23 | based solely on the semantic similarity between the groundtruth and the candidate answer, \
24 | disregarding any grammatical differences.
25 | A rating of 0 suggests no similarity, implying the candidate answer is entirely incorrect.
26 | A rating of 1 suggests low similarity, meaning the candidate answer is largely incorrect.
27 | A rating of 2 suggests high similarity, meaning the candidate answer is largely correct.
28 | Lastly, a rating of 3 indicates complete similarity, which means the candidate answer is entirely correct.
29 | Your response should be a single integer from 0, 1, 2, or 3.
30 | """
31 |
32 | # tmpl = 'Groundtruth answer: {}\nCandidate answer: {}\nYour response: '
33 | tmpl = 'Question: {}\nGroundtruth answer: {}\nCandidate answer: {}\nYour response: '
34 |
35 | def qwen2_5_evaluation(question, gt, candidate):
36 | user_prompt=tmpl.format(question, gt, candidate)
37 | messages = [
38 | {"role": "system", "content": system_prompt},
39 | {"role": "user", "content": user_prompt}
40 | ]
41 | text = tokenizer.apply_chat_template(
42 | messages,
43 | tokenize=False,
44 | add_generation_prompt=True
45 | )
46 | model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
47 |
48 | generated_ids = model.generate(
49 | **model_inputs,
50 | max_new_tokens=512
51 | )
52 | generated_ids = [
53 | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
54 | ]
55 |
56 | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
57 | score = response
58 | # print(score)
59 | # breakpoint()
60 | try:
61 | score = int(score)
62 | except (ValueError, TypeError):
63 | score = -1
64 | return score
65 |
66 |
67 | def calculate_temporal_iou(gt_range, pred_range):
68 | """ calculate Temporal IoU"""
69 | if not pred_range:
70 | return 0.0
71 |
72 |
73 | if isinstance(pred_range, str):
74 | try:
75 | pred_range = ast.literal_eval(pred_range)
76 | except (ValueError, SyntaxError):
77 | return 0.0
78 |
79 | if not isinstance(pred_range, (list, tuple)) or len(pred_range) != 2 or \
80 | not all(isinstance(x, (int, float)) for x in pred_range):
81 | return 0.0
82 |
83 | gt_start, gt_end = gt_range
84 | pred_start, pred_end = pred_range
85 | intersection = max(0, min(gt_end, pred_end) - max(gt_start, pred_start))
86 | union = max(gt_end, pred_end) - min(gt_start, pred_start)
87 | return intersection / union if union > 0 else 0.0
88 |
89 |
90 | def compute_iou(gt_bbox, pred_bbox):
91 | """calculate 2 bbox IoU"""
92 | if not isinstance(pred_bbox, (list, tuple)) or len(pred_bbox) != 4:
93 | return 0.0
94 |
95 | # GT bbox
96 | gt_xmin, gt_ymin, gt_xmax, gt_ymax = gt_bbox['xmin'], gt_bbox['ymin'], gt_bbox['xmax'], gt_bbox['ymax']
97 | pred_xmin, pred_ymin, pred_xmax, pred_ymax = pred_bbox
98 |
99 | # Intersection
100 | x1 = max(gt_xmin, pred_xmin)
101 | y1 = max(gt_ymin, pred_ymin)
102 | x2 = min(gt_xmax, pred_xmax)
103 | y2 = min(gt_ymax, pred_ymax)
104 | intersection = max(0, x2 - x1) * max(0, y2 - y1)
105 |
106 | # Union
107 | gt_area = (gt_xmax - gt_xmin) * (gt_ymax - gt_ymin)
108 | pred_area = (pred_xmax - pred_xmin) * (pred_ymax - pred_ymin)
109 | union = gt_area + pred_area - intersection
110 |
111 | return intersection / union if union > 0 else 0.0
112 |
113 | def calculate_bbox_iou(gt_bbox, pred_bboxes):
114 | """Calculate single BBox IoU, support multiple prediction frames to get maximum IoU"""
115 | try:
116 | if not pred_bboxes:
117 | return 0.0
118 |
119 | if isinstance(pred_bboxes[0], (int, float)) and len(pred_bboxes) == 4:
120 | pred_bboxes = [pred_bboxes]
121 |
122 | return max([compute_iou(gt_bbox, pred_bbox) for pred_bbox in pred_bboxes])
123 | except:
124 | return 0.0
125 |
126 | def calculate_spatial_metrics(gt_bboxes, pred_bboxes):
127 | """calculate Spatial IoU and mAP"""
128 | if not pred_bboxes:
129 | return [0.0] * 5, 0.0
130 |
131 | iou_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
132 | ious = []
133 | aps = []
134 | for gt_bbox_entry in gt_bboxes:
135 | frame_id = str(box["timestamp"])
136 | if frame_id in pred_bboxes:
137 | pred_bbox = pred_bboxes[frame_id]
138 | gt_bbox = {
139 | "xmin": box["xmin"],
140 | "ymin": box["ymin"],
141 | "xmax": box["xmax"],
142 | "ymax": box["ymax"]
143 | }
144 | iou = calculate_bbox_iou(gt_bbox, pred_bbox)
145 | ious.append(iou)
146 | else:
147 | ious.append(0.0)
148 | mIoU = np.mean(ious) if ious else 0.0
149 |
150 | for threshold in iou_thresholds:
151 | scores = [1 if iou >= threshold else 0 for iou in ious]
152 | if len(ious) > 0:
153 | aps.append(np.mean(scores))
154 | else:
155 | aps.append(0.0)
156 | return aps, mIoU
157 |
158 | def calculate_spatial_random(gt_bboxes, w, h):
159 | """calculate Spatial IoU and mAP"""
160 | pred_bbox = [0, 0, w, h]
161 | iou_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
162 | ious = []
163 | aps = []
164 | for gt_bbox_entry in gt_bboxes:
165 | for frame_id, gt_bbox in gt_bbox_entry.items():
166 | iou = calculate_bbox_iou(gt_bbox, pred_bbox)
167 | ious.append(iou)
168 | mIoU = np.mean(ious) if ious else 0.0
169 |
170 | for threshold in iou_thresholds:
171 | scores = [1 if iou >= threshold else 0 for iou in ious]
172 | if len(ious) > 0:
173 | aps.append(np.mean(scores))
174 | else:
175 | aps.append(0.0)
176 | return aps, mIoU
177 |
178 | # evaluate the json file
179 | def evaluate_json(file_path):
180 | with open(file_path, 'r') as f:
181 | data = json.load(f)
182 |
183 | model_name = file_path.split("/")[-1].split("_")[0]
184 | domains = {}
185 | durations = {}
186 | overall_stats = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)],
187 | "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[],
188 | "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[],
189 | "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]}
190 |
191 |
192 | for idx, item in enumerate(tqdm(data, desc=f"Evaluating {model_name} results", unit="item")):
193 | video_length = round(item['frame_count']/item['fps'], 2)
194 | w, h = item['width'], item['height']
195 | domain = item.get("domain", "unknown")
196 | if domain not in domains:
197 | domains[domain] = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)],
198 | "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[],
199 | "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[],
200 | "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]}
201 |
202 | if video_length < 60:
203 | duration = "Short"
204 | elif 60 <= video_length < 180:
205 | duration = "Medium"
206 | else:
207 | duration = "Long"
208 | if duration not in durations:
209 | durations[duration] = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)],
210 | "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[],
211 | "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[],
212 | "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]}
213 |
214 | if 'answer_vqa' in item and item['answer_vqa']:
215 | score = qwen2_5_evaluation(item['question'], item['answer'], item['answer_vqa'])
216 | else:
217 | continue
218 | overall_stats["all_rating"].append(score if score != -1 else 0)
219 | domains[domain]["all_rating"].append(score if score != -1 else 0)
220 | durations[duration]["all_rating"].append(score if score != -1 else 0)
221 | if score != -1:
222 | overall_stats["valid_rating"].append(score)
223 | domains[domain]["valid_rating"].append(score)
224 | durations[duration]["valid_rating"].append(score)
225 | if score >= 2:
226 | overall_stats["correct_num"] += 1
227 | domains[domain]["correct_num"] += 1
228 | durations[duration]["correct_num"] += 1
229 | data[idx]["VQA_score"] = score
230 | # answer_temporal
231 | if 'answer_temporal' in item and item['answer_temporal']:
232 | temporal_iou = calculate_temporal_iou(item['timestamps'], item['answer_temporal'])
233 | else:
234 | temporal_iou = 0.0
235 |
236 | overall_stats["temporal_ious"].append(temporal_iou)
237 | domains[domain]["temporal_ious"].append(temporal_iou)
238 | durations[duration]["temporal_ious"].append(temporal_iou)
239 | data[idx]["temporal_IoU"] = temporal_iou
240 |
241 | # answer_temporal_2
242 | if 'answer_temporal_2' in item and item['answer_temporal_2']:
243 | temporal_iou_2 = calculate_temporal_iou(item['timestamps'], item['answer_temporal_2'])
244 | else:
245 | temporal_iou_2 = 0.0
246 |
247 | overall_stats["temporal_ious_2"].append(temporal_iou_2)
248 | domains[domain]["temporal_ious_2"].append(temporal_iou_2)
249 | durations[duration]["temporal_ious_2"].append(temporal_iou_2)
250 | data[idx]["temporal_IoU_2"] = temporal_iou_2
251 |
252 | random_iou = calculate_temporal_iou(item['timestamps'],[0, video_length])
253 | overall_stats["random_tious"].append(random_iou)
254 | domains[domain]["random_tious"].append(random_iou)
255 | durations[duration]["random_tious"].append(random_iou)
256 |
257 | # answer_spatial
258 | if 'answer_spatial' in item and item['answer_spatial']:
259 | aps, mIoU = calculate_spatial_metrics(item['bboxes'], item['answer_spatial'])
260 | else:
261 | aps, mIoU = [0.0] * 5, 0.0
262 | for i, ap in enumerate(aps):
263 | domains[domain]["spatial_aps"][i].append(ap)
264 | durations[duration]["spatial_aps"][i].append(ap)
265 | overall_stats["spatial_aps"][i].append(ap)
266 | domains[domain]["spatial_mious"].append(mIoU)
267 | durations[duration]["spatial_mious"].append(mIoU)
268 | overall_stats["spatial_mious"].append(mIoU)
269 | data[idx]["AP1@0.1:0.9"] = aps
270 | data[idx]["spatial_mIoU"] = mIoU
271 |
272 | # answer_spatial_2
273 | if 'answer_spatial_2' in item and item['answer_spatial_2']:
274 | aps_2, mIoU_2 = calculate_spatial_metrics(item['bboxes'], item['answer_spatial_2'])
275 | else:
276 | aps_2, mIoU_2 = [0.0] * 5, 0.0
277 | for i, ap in enumerate(aps_2):
278 | domains[domain]["spatial_aps_2"][i].append(ap)
279 | durations[duration]["spatial_aps_2"][i].append(ap)
280 | overall_stats["spatial_aps_2"][i].append(ap)
281 | domains[domain]["spatial_mious_2"].append(mIoU_2)
282 | durations[duration]["spatial_mious_2"].append(mIoU_2)
283 | overall_stats["spatial_mious_2"].append(mIoU_2)
284 | data[idx]["AP2@0.1:0.9"] = aps_2
285 | data[idx]["spatial_mIoU_2"] = mIoU_2
286 |
287 |
288 | random_aps, random_mIoU = calculate_spatial_random(item['bboxes'], w, h)
289 | for i, ap in enumerate(random_aps):
290 | domains[domain]["random_aps"][i].append(ap)
291 | durations[duration]["random_aps"][i].append(ap)
292 | overall_stats["random_aps"][i].append(ap)
293 | domains[domain]["random_vious"].append(random_mIoU)
294 | durations[duration]["random_vious"].append(random_mIoU)
295 | overall_stats["random_vious"].append(random_mIoU)
296 |
297 | with open(f'metrics/{model_name}_merged_v2_metrics.json', 'w') as f:
298 | json.dump(data, f, indent=4)
299 |
300 | if score >= 2 and temporal_iou >= 0.3:
301 | domains[domain]["vqa_temporal_idx"].append(idx)
302 | durations[duration]["vqa_temporal_idx"].append(idx)
303 | overall_stats["vqa_temporal_idx"].append(idx)
304 | if score >= 2 and temporal_iou_2 >= 0.3:
305 | domains[domain]["vqa_temporal_idx_2"].append(idx)
306 | durations[duration]["vqa_temporal_idx_2"].append(idx)
307 | overall_stats["vqa_temporal_idx_2"].append(idx)
308 | if score >= 2 and mIoU >= 0.1:
309 | domains[domain]["vqa_spatial_idx"].append(idx)
310 | durations[duration]["vqa_spatial_idx"].append(idx)
311 | overall_stats["vqa_spatial_idx"].append(idx)
312 | if score >= 2 and mIoU_2 >= 0.1:
313 | domains[domain]["vqa_spatial_idx_2"].append(idx)
314 | durations[duration]["vqa_spatial_idx_2"].append(idx)
315 | overall_stats["vqa_spatial_idx_2"].append(idx)
316 | if temporal_iou >= 0.3 and mIoU >= 0.1:
317 | domains[domain]["temporal_spatial_idx"].append(idx)
318 | durations[duration]["temporal_spatial_idx"].append(idx)
319 | overall_stats["temporal_spatial_idx"].append(idx)
320 | if temporal_iou_2 >= 0.3 and mIoU_2 >= 0.1:
321 | domains[domain]["temporal_spatial_idx_2"].append(idx)
322 | durations[duration]["temporal_spatial_idx_2"].append(idx)
323 | overall_stats["temporal_spatial_idx_2"].append(idx)
324 | if score >= 2 and temporal_iou >= 0.3 and mIoU >= 0.1:
325 | domains[domain]["vqa_temp_spatial_idx"].append(idx)
326 | durations[duration]["vqa_temp_spatial_idx"].append(idx)
327 | overall_stats["vqa_temp_spatial_idx"].append(idx)
328 | if score >= 2 and temporal_iou_2 >= 0.3 and mIoU_2 >= 0.1:
329 | domains[domain]["vqa_temp_spatial_idx_2"].append(idx)
330 | durations[duration]["vqa_temp_spatial_idx_2"].append(idx)
331 | overall_stats["vqa_temp_spatial_idx_2"].append(idx)
332 |
333 | with open(f'metrics/{model_name}_merged_v2_metrics.json', 'w') as f:
334 | json.dump(data, f, indent=4)
335 |
336 | def print_stats(label, stats, total_samples):
337 | avg_all_score = np.mean(stats["all_rating"])
338 | avg_valid_score = np.mean(stats["valid_rating"]) if stats["valid_rating"] else 0
339 | acc_vqa = stats["correct_num"] / total_samples
340 |
341 | r1_iou30 = np.mean([1 if iou >= 0.3 else 0 for iou in stats["temporal_ious"]])
342 | r1_iou50 = np.mean([1 if iou >= 0.5 else 0 for iou in stats["temporal_ious"]])
343 | r1_iou70 = np.mean([1 if iou >= 0.7 else 0 for iou in stats["temporal_ious"]])
344 | mean_temporal_iou = np.mean(stats["temporal_ious"])
345 |
346 | r1_iou30_2 = np.mean([1 if iou >= 0.3 else 0 for iou in stats["temporal_ious_2"]])
347 | r1_iou50_2 = np.mean([1 if iou >= 0.5 else 0 for iou in stats["temporal_ious_2"]])
348 | r1_iou70_2 = np.mean([1 if iou >= 0.7 else 0 for iou in stats["temporal_ious_2"]])
349 | mean_temporal_iou_2 = np.mean(stats["temporal_ious_2"])
350 |
351 | mean_aps = [np.mean(ar_list) for ar_list in stats["spatial_aps"]]
352 | mean_miou = np.mean(stats["spatial_mious"])
353 |
354 | mean_aps_2 = [np.mean(ar_list) for ar_list in stats["spatial_aps_2"]]
355 | mean_miou_2 = np.mean(stats["spatial_mious_2"])
356 |
357 |
358 | vqa_temp = len(stats["vqa_temporal_idx"]) / total_samples
359 | vqa_temp_2 = len(stats["vqa_temporal_idx_2"]) / total_samples
360 | vqa_spat = len(stats["vqa_spatial_idx"]) / total_samples
361 | vqa_spat_2 = len(stats["vqa_spatial_idx_2"]) / total_samples
362 | temp_spat = len(stats["temporal_spatial_idx"]) / total_samples
363 | temp_spat_2 = len(stats["temporal_spatial_idx_2"]) / total_samples
364 | vqa_temp_spat = len(stats["vqa_temp_spatial_idx"]) / total_samples
365 | vqa_temp_spat_2 = len(stats["vqa_temp_spatial_idx_2"]) / total_samples
366 |
367 | print(f"{label}:")
368 | print(f"VQA: Avg All Score: {avg_all_score:.4f}, Avg Valid Score: {avg_valid_score:.4f}, Accuracy: {acc_vqa:.4f}")
369 | print("Chain 1:")
370 | print(f"Temporal Answer: R1@IoU=0.3: {r1_iou30:.4f}, R1@IoU=0.5: {r1_iou50:.4f}, R1@IoU=0.7: {r1_iou70:.4f}, Mean IoU: {mean_temporal_iou:.4f}")
371 | print(f"Spatial Answer: mAP@0.1: {mean_aps[0]:.4f}, mAP@0.3: {mean_aps[1]:.4f}, mAP@0.5: {mean_aps[2]:.4f}, mAP@0.7: {mean_aps[3]:.4f}, mAP@0.9: {mean_aps[4]:.4f}, Mean mIoU: {mean_miou:.4f}")
372 | print("\n")
373 | print("Chain 2:")
374 | print(f"Temporal Answer: R1@IoU=0.3: {r1_iou30_2:.4f}, R1@IoU=0.5: {r1_iou50_2:.4f}, R1@IoU=0.7: {r1_iou70_2:.4f}, Mean IoU: {mean_temporal_iou_2:.4f}")
375 | print(f"Spatial Answer: mAP@0.1: {mean_aps_2[0]:.4f}, mAP@0.3: {mean_aps_2[1]:.4f}, mAP@0.5: {mean_aps_2[2]:.4f}, mAP@0.7: {mean_aps_2[3]:.4f}, mAP@0.9: {mean_aps_2[4]:.4f}, Mean mIoU: {mean_miou_2:.4f}")
376 | print("\n")
377 |
378 | AM = (acc_vqa + mean_temporal_iou + mean_miou)/3
379 | AM2 = (acc_vqa + mean_temporal_iou_2 + mean_miou_2)/3
380 | mAM = (AM + AM2) / 2
381 |
382 | LGM = -(math.log(1 - acc_vqa) + math.log(1 - mean_temporal_iou) + math.log(1 - mean_miou)) / 3
383 | LGM2 = -(math.log(1 - acc_vqa) + math.log(1 - mean_temporal_iou_2) + math.log(1 - mean_miou_2)) / 3
384 | mLGM = (LGM + LGM2) / 2
385 |
386 | print(f"AM1:{AM:.4f}, AM2:{AM2:.4f}, mAM:{mAM:.4f}")
387 | print(f"LGM1:{LGM:.4f}, LGM2:{LGM2:.4f}, mLGM:{mLGM:.4f}\n")
388 |
389 | print("Combined resutls:")
390 | print(f"VQA & Temp: Chain 1: {vqa_temp:.4f}, Chain 2: {vqa_temp_2:.4f}")
391 | print(f"VQA & Spat: Chain 1: {vqa_spat:.4f} Chain 2: {vqa_spat_2:.4f}")
392 | print(f"Temp & Spat: Chain 1: {temp_spat:.4f} Chain 2: {temp_spat_2:.4f}")
393 | print(f"VQA & Temp & Spat: Chain 1:{vqa_temp_spat:.4f} Chain 2: {vqa_temp_spat_2:.4f}")
394 | print(f"VQA & Temp list: \n Chain 1:{stats['vqa_temporal_idx']} \nChain 2:{stats['vqa_temporal_idx_2']}")
395 | print(f"VQA & Spat list: \n Chain 1:{stats['vqa_spatial_idx']} \n Chain 2: {stats['vqa_spatial_idx_2']}")
396 | print(f"Temp & Spat list: \n Chain 1:{stats['temporal_spatial_idx']} \n Chain 2: {stats['temporal_spatial_idx_2']}")
397 | print(f"VQA & Temp & Spat list: \n Chain 1:{stats['vqa_temp_spatial_idx']} \n Chain 2:{stats['vqa_temp_spatial_idx_2']}\n")
398 |
399 | print_stats("Overall Statistics", overall_stats, len(data))
400 | for duration, stats in durations.items():
401 | print_stats(f"Video Length: {duration}", stats, len(stats["all_rating"]))
402 | for domain, stats in domains.items():
403 | print_stats(f"Domain: {domain}", stats, len(stats["all_rating"]))
404 |
405 |
406 | # print("\nEvaluating GPT-4o:\n")
407 | # evaluate_json('results/gpt4o/gpt4o_answer_update_merged.json')
408 |
409 | # print("\nEvaluating Gemini-2-Flash:\n")
410 | # evaluate_json('results/gemini2/gemini2_answer_merged.json')
411 |
412 | # print("\nEvaluating Video-Llama3:\n")
413 | # evaluate_json('results/videollama3/videollama3_answer_merged.json')
414 |
415 | print("\nEvaluating Qwen2.5-VL:")
416 | evaluate_json('results/qwen2_5/qwen2-5-32B_answer_update.json')
417 |
418 | # print("\nEvaluating InternVL-2.5-38B:")
419 | # evaluate_json('results/internvl2_5/internvl2-5-38B_answer.json')
420 |
421 | # print("\nEvaluating InternVL-2.5:")
422 | # evaluate_json('results/internvl2_5/internvl2_5_answer_merged.json')
423 |
424 | # print("\nEvaluating Llava-Video:")
425 | # evaluate_json('results/llava-video/llavavideo_answer_merged.json')
426 |
427 | # print("\nEvaluating Qwen2-VL:")
428 | # evaluate_json('results/qwen2/qwen2vl_answer_merged.json')
429 |
430 | # print("\nEvaluating VideoChat2:")
431 | # evaluate_json('results/videochat2/videochat2_answer_merged.json')
432 |
433 | # print("\nEvaluating Oryx-1.5:")
434 | # evaluate_json('results/oryx-1.5/oryx15_answer_merged.json')
435 |
436 | # print("\nEvaluating VideoCCAM:")
437 | # evaluate_json('results/videoccam/videoccam12_answer_merged.json')
438 |
439 | # print("\nEvaluating TimeChat:")
440 | # evaluate_json('results/timechat/timechat_answer_merged.json')
441 |
442 | # print("\nEvaluating VTimeLLM:")
443 | # evaluate_json('results/vtimellm/vtimellm_answer_merged.json')
444 |
445 | # print("\nEvaluating Trace:")
446 | # evaluate_json('results/trace/trace_answer_merged.json')
447 |
448 | # print("\nEvaluating Sa2VA:")
449 | # evaluate_json('results/sa2va/sa2va_answer_merged.json')
450 |
451 |
452 |
--------------------------------------------------------------------------------
/inference_demo.py:
--------------------------------------------------------------------------------
1 | import cv2 # We're using OpenCV to read video, to install !pip install opencv-python
2 | import base64
3 | import time
4 | import os
5 | import requests
6 | from tqdm import tqdm
7 | import json
8 | import re
9 | import torch
10 | import math
11 |
12 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
13 | from qwen_vl_utils import process_vision_info
14 |
15 | model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
16 |
17 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
18 | model_path,
19 | torch_dtype=torch.bfloat16,
20 | attn_implementation="flash_attention_2",
21 | device_map="auto"
22 | )
23 | processor = AutoProcessor.from_pretrained(model_path)
24 |
25 | def inference(video_path, prompt, max_new_tokens=2048, total_pixels=20480 * 28 * 28, min_pixels=16 * 28 * 28):
26 | messages = [
27 | {"role": "system", "content": "You are a helpful assistant."},
28 | {"role": "user", "content": [
29 | {"type": "text", "text": prompt},
30 | {"type": "video", "video": video_path, "total_pixels": total_pixels, "min_pixels": min_pixels, "fps": 1.0},
31 | ]
32 | },
33 | ]
34 | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
35 | image_inputs, video_inputs, video_kwargs = process_vision_info([messages], return_video_kwargs=True)
36 | fps_inputs = video_kwargs['fps']
37 | print("video input:", video_inputs[0].shape)
38 | num_frames, _, resized_height, resized_width = video_inputs[0].shape
39 | # print("num of video tokens:", int(num_frames / 2 * resized_height / 28 * resized_width / 28))
40 | inputs = processor(text=[text], images=image_inputs, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt")
41 | inputs = inputs.to('cuda')
42 |
43 | output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
44 | generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
45 | output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
46 | return output_text[0], resized_height, resized_width
47 |
48 | def read_anno(anno_file):
49 | with open(anno_file, 'r') as f:
50 | data = json.load(f)
51 | return data
52 |
53 | def find_video(video_folder, vid):
54 | """
55 | Finds the vid.mp4 file in the video_folder and its subfolders.
56 |
57 | Args.
58 | video_folder (str): path of the folder to search.
59 | vid (str): the filename of the video (without extension).
60 |
61 | Returns.
62 | str: absolute path to the vid.mp4 file, or None if not found.
63 | """
64 | target_filename = f"{vid}.mp4"
65 | for root, _, files in os.walk(video_folder):
66 | if target_filename in files:
67 | return os.path.join(root, target_filename)
68 | return None
69 |
70 | def get_answer_vqa(data, video_path):
71 | prompt = f"Answer the question about the video: {data['question']} \n (If the answer is a person, you don't need to identify the person.)"
72 | answer_vqa, _, _ = inference(video_path, prompt)
73 | return answer_vqa
74 |
75 | def get_answer_temporal(data, video_path):
76 | video_length = round(data['frame_count']/data['fps'], 2)
77 | temporal_question = data['temporal_question']
78 | prompt = f"This video is {video_length} seconds long. Answer the question about the video: {temporal_question} \n Output the start and end moment timestamps.",
79 | answer_temporal, _, _ = inference(video_path, prompt)
80 | return answer_temporal
81 |
82 | def get_answer_temporal_2(data, video_path, bboxes):
83 | video_length = round(data['frame_count']/data['fps'], 2)
84 | temporal_question = data['temporal_question']
85 | w, h = data['width'], data['height']
86 | prompt = f"This video is {video_length} seconds long with a resolution of {w}x{h} (width x height). Answer the question about the video: {temporal_question} \n There are {len(bboxes)} bounding boxes of the key object related to the question in the video without knowing the time, which are:{bboxes}. Output the start and end moment timestamps.",
87 | answer_temporal, _, _ = inference(video_path, prompt)
88 | return answer_temporal
89 |
90 | def get_answer_spatial(data, video_path):
91 | video_length = round(data['frame_count']/data['fps'], 2)
92 | st, et = math.ceil(data['timestamps'][0]), math.floor(data['timestamps'][1])
93 | time_range = list(range(st, et + 1))
94 | w, h = data['width'], data['height']
95 | spatial_question = data['spatial_question']
96 | prompt = f"""Please answer the question about the video: {spatial_question} with a series of bounding boxes in [x1, y1, x2, y2] format. \n
97 | For each whole second within the time range {time_range} provided (inclusive of the boundaries), output a series of bounding boxes of the object in JSON format. The keys should be the whole seconds (as strings), and the values should be the box in [x1, y1, x2, y2] format.
98 | Example output: {{"{time_range[0]}": [x1, y1, x2, y2],...}}
99 | """
100 | answer_spatial, input_height, input_width = inference(video_path, prompt)
101 | return answer_spatial, input_height, input_width
102 |
103 | def get_answer_spatial_2(data, video_path, bboxes):
104 | video_length = round(data['frame_count']/data['fps'], 2)
105 | st, et = math.ceil(data['timestamps'][0]), math.floor(data['timestamps'][1])
106 | time_range = list(range(st, et + 1))
107 | w, h = data['width'], data['height']
108 | spatial_question = data['spatial_question_2']
109 | prompt = f"""Please answer the question about the video: {spatial_question} with a series of bounding boxes in [x1, y1, x2, y2] format. \n
110 | For each whole second that may related to the question, output a series of bounding boxes of the object in JSON format. You only need to output {len(bboxes)} bbox(es). You need to determine which frame is related to the question, and you don't need to output the bbox for the frames not related to the question.
111 | The keys should be the whole seconds (as strings), and the values should be the bounding box in [x0,y0,x1,y1] format.
112 | \n Example output:
113 | {{"0": [x0,y0,x1,y1], "1":..., ..., "{len(bboxes)}":...}} (if the frames at 0~{len(bboxes)} second are related to the questions)
114 | """
115 | answer_spatial, input_height, input_width = inference(video_path, prompt)
116 | return answer_spatial, input_height, input_width
117 |
118 | def extract_timestamps(result):
119 | """extract timestamps from the answer"""
120 | match = re.findall(r"\b\d+(?:\.\d+)?\b", result)
121 | return [float(match[0]), float(match[1])] if len(match) == 2 else []
122 |
123 |
124 | def fix_incomplete_json(json_str):
125 | """
126 | fix the incomplete brackets of the json
127 | """
128 | # Counting left and right brackets
129 | open_square = json_str.count('[')
130 | close_square = json_str.count(']')
131 | open_curly = json_str.count('{')
132 | close_curly = json_str.count('}')
133 |
134 | # Complete the square brackets
135 | if open_square > close_square:
136 | json_str += ']' * (open_square - close_square)
137 | elif close_square > open_square:
138 | json_str = '[' * (close_square - open_square) + json_str
139 |
140 | # Complete the curly brackets
141 | if open_curly > close_curly:
142 | json_str += '}' * (open_curly - close_curly)
143 | elif close_curly > open_curly:
144 | json_str = '{' * (close_curly - open_curly) + json_str
145 |
146 | return json_str
147 |
148 |
149 | def extract_bounding_boxes(answer_spatial, data, input_height, input_width):
150 | """
151 | Extract bounding boxes from the input answer_spatial and denormalize the coordinates using the width and height from the data.
152 | """
153 | w, h = data['width'], data['height']
154 |
155 | def denormalize_bbox(bbox):
156 | """
157 | denormalize the coordinates of bbox
158 | """
159 | try:
160 | if len(bbox) == 1:
161 | bbox = bbox[0]
162 | if len(bbox) == 2:
163 | bbox = bbox[1]
164 | x_min = int(bbox[0] / input_width * w)
165 | y_min = int(bbox[1] / input_height * h)
166 | x_max = int(bbox[2] / input_width * w)
167 | y_max = int(bbox[3] / input_height * h)
168 | return [x_min, y_min, x_max, y_max]
169 | except Exception as e:
170 | print(f"Processing {bbox} occurs Error {e}")
171 | return bbox
172 |
173 | # match markdown json
174 | markdown_pattern = r'```json\s*\n(\[.*?\]|\{.*?\})\s*\n```'
175 | match = re.search(markdown_pattern, answer_spatial, re.DOTALL)
176 | if not match:
177 | # If there is no Markdown wrapper, then try to match the JSON format directly
178 | json_pattern = r'(\[[\s\S]*\]|\{[\s\S]*\})'
179 | match = re.search(json_pattern, answer_spatial, re.DOTALL)
180 | if match:
181 | # match bbox in JSON
182 | bounding_boxes_str = match.group(1).strip()
183 | # Replace single quotes with double quotes to conform to the JSON specification
184 | bounding_boxes_str = bounding_boxes_str.replace("'", '"')
185 | try:
186 | # Convert strings to dictionary or list format
187 | bounding_boxes = json.loads(bounding_boxes_str)
188 | # If it's a list and contains a dictionary inside, expand it to a single dictionary
189 | if isinstance(bounding_boxes, list) and all(isinstance(item, dict) for item in bounding_boxes):
190 | combined_dict = {}
191 | for item in bounding_boxes:
192 | combined_dict.update(item)
193 | bounding_boxes = combined_dict
194 | # Determine if the extracted JSON is a dictionary or a list.
195 | if isinstance(bounding_boxes, list):
196 | # bounding boxes in list
197 | return {str(box[0]): box[1] for box in bounding_boxes}
198 | elif isinstance(bounding_boxes, dict):
199 | # bounding boxes in dictionary
200 | return {key: value for key, value in bounding_boxes.items()}
201 | except Exception as e:
202 | # if failed, try to fix it.
203 | fixed_bounding_boxes_str = fix_incomplete_json(bounding_boxes_str)
204 | try:
205 | bounding_boxes = json.loads(fixed_bounding_boxes_str)
206 | if isinstance(bounding_boxes, list):
207 | return [box for box in bounding_boxes]
208 | elif isinstance(bounding_boxes, dict):
209 | return {key: value for key, value in bounding_boxes.items()}
210 | except Exception as e:
211 | print(f"Failed after fixing: {e}\nExtracted JSON: {fixed_bounding_boxes_str}")
212 | return None
213 | else:
214 | print("No match found for the bounding box JSON.")
215 | return None
216 |
217 | def test_qwen2_5vl(video_folder, anno_file, result_file):
218 | anno = read_anno(anno_file)
219 |
220 | for idx, data in enumerate(tqdm(anno, desc="Processing videos", unit="video")):
221 | try:
222 | vid = data['vid']
223 | timestamps = data['timestamps']
224 | video_length = round(data['frame_count']/data['fps'], 1)
225 | boxes = [[box_data["xmin"], box_data["ymin"], box_data["xmax"], box_data["ymax"]] \
226 | for box in data["bboxes"] for box_data in box.values()]
227 | video_path = find_video(video_folder, vid)
228 | answer_vqa = get_answer_vqa(data, video_path)
229 | # chain one
230 | answer_temporal = get_answer_temporal(data, video_path)
231 | answer_temporal_post = extract_timestamps(answer_temporal)
232 |
233 | answer_spatial, input_height, input_width = get_answer_spatial(data, video_path)
234 | answer_spatial_post = extract_bounding_boxes(answer_spatial, data, input_height, input_width)
235 |
236 | # chain two
237 | answer_spatial_2, input_height, input_width = get_answer_spatial_2(data, video_path, boxes)
238 | answer_spatial_post_2 = extract_bounding_boxes(answer_spatial_2, data, input_height, input_width)
239 |
240 | answer_temporal_2 = get_answer_temporal_2(data, video_path, boxes)
241 | answer_temporal_post_2 = extract_timestamps(answer_temporal_2)
242 |
243 | # update data
244 | data['answer_vqa'] = answer_vqa
245 | data['answer_temporal_pre'] = answer_temporal
246 | data['answer_temporal'] = answer_temporal_post
247 | data['answer_spatial_pre'] = answer_spatial
248 | data['answer_spatial'] = answer_spatial_post
249 |
250 | data['answer_spatial_pre_2'] = answer_spatial_2
251 | data['answer_spatial_2'] = answer_spatial_post_2
252 | data['answer_temporal_pre_2'] = answer_temporal_2
253 | data['answer_temporal_2'] = answer_temporal_post_2
254 | # update result_file
255 | with open(result_file, 'w') as f:
256 | json.dump(anno, f, indent=4)
257 | except Exception as e:
258 | print("")
259 | print(f"ERROR in data {idx}: {e}.")
260 | continue
261 |
262 |
263 | if __name__ == "__main__":
264 | video_folder = "/Path/to/video/folder"
265 | anno_file = "/path/to/anno/file.json"
266 | result_file = "/path/to/result/file.json"
267 | test_qwen2_5vl(video_folder, anno_file, result_file)
268 |
--------------------------------------------------------------------------------