├── .gitignore
├── README.md
├── assets
    ├── radar.png
    ├── v_star_logo.png
    ├── vis.png
    └── vis_v2.png
├── data
    ├── README.md
    └── V_STaR_test.json
├── eval.py
└── inference_demo.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ![v_star_logo](https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/main/assets/v_star_logo.png)
  2 | 
  3 | <!-- [![arXiv](https://img.shields.io/badge/arXiv-2311.99999-b31b1b.svg)](https://arxiv.org/abs/2311.99999) -->
  4 | [![V-STaR Paper](https://img.shields.io/badge/ArXiV%202025-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/pdf/2503.11495)
  5 | [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue)](https://huggingface.co/spaces/V-STaR-Bench/V-STaR-LeaderBoard)
  6 | [![Dataset Download](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset%20Download-blue)](https://huggingface.co/datasets/V-STaR-Bench/V-STaR)
  7 | [![Project Page](https://img.shields.io/badge/VSTaR-Website-green?logo=googlechrome&logoColor=green)](https://v-star-bench.github.io/)
  8 | [![Project Page](https://img.shields.io/badge/Collection-MLLM%20Reasoning-blue?logo=github&logoColor=)](https://github.com/lwpyh/Awesome-MLLM-Reasoning-Collection)
  9 | 
 10 | 
 11 | 
 12 | This repository contains the implementation of the following paper and its related serial works in progress. We evaluate Video LLMs models!
 13 | > **V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning**<br><p align="center">
 14 |     <a href="https://zxccade.github.io/">Zixu Cheng</a>,&nbsp;&nbsp;
 15 |     <a href="https://lwpyh.github.io/">Jian Hu<sup>+</sup></a>,&nbsp;&nbsp;
 16 |     <a href="https://sites.google.com/view/ziquanliu">Ziquan Liu</a>,&nbsp;&nbsp;
 17 |     <a href="https://chenyangsi.top/">Chenyang Si</a>,&nbsp;&nbsp;
 18 |     <a href="https://weivision.github.io/">Wei Li</a>,&nbsp;&nbsp;
 19 |     <a href="http://www.eecs.qmul.ac.uk/~sgg/">Shaogang Gong</a>
 20 | </p>
 21 | 
 22 | 
 23 | 
 24 | ### Table of Contents
 25 | - [Updates](#updates)
 26 | - [Overview](#overview)
 27 | - [Evaluation Results](#evaluation_results)
 28 | - [Usage](#usage)
 29 | - [Citation and Acknowledgement](#citation_and_acknowledgement)
 30 | 
 31 | <a name="update"></a>
 32 | ## :mega: Updates
 33 | - [03/2025] **MLLM reasoning collections released:** We’re excited to announce the release of the [Awesome-MLLM-Reasoning-Collection](https://github.com/lwpyh/Awesome-MLLM-Reasoning-Collection) repository! 🚀 This collection curates papers, code, datasets, benchmarks, and resources dedicated to MLLM reasoning, providing a valuable resource for researchers and practitioners in the field. Check it out and contribute! 🔍📚💡
 34 | - [03/2025] **Dataset released:** Our evaluation dataset is released.
 35 | - [03/2025] **Code and leaderboard released:** Our V-STAR benchmark is released.
 36 | <a name="overview"></a>
 37 | ## :mega: Overview
 38 | ![Abstract](assets/vis.png)
 39 | <b>Overview of V-STaR.</b> Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames (<b>"when"</b>) and then analyse the spatial relationships (<b>"where"</b>) between key objects, and finally leverage these relationships to draw inferences (<b>"what"</b>). However, can Video Large Language Models (Video-LLMs) also "<b>reason through a sequential spatio-temporal logic</b>" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a <b>V</b>ideo <b>S</b>patio-<b>T</b>empor<b>a</b>l <b>R</b>easoning (<b>V-STaR</b>) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning. 
 40 | 
 41 | 
 42 | <a name="evaluation_results"></a>
 43 | ## :mortar_board: Evaluation Results
 44 | 
 45 | ***See our leaderboard for the most updated ranking and numerical results (with models like GPT-4o, Gemini-2-flash and Qwen2.5-VL)***. [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue)](https://v-star-bench-v-star-leaderboard.hf.space/)
 46 | 
 47 | We visualize the evaluation results of the 6 most recent top-performing Video-LLMs across 9 V-STaR domains.
 48 | 
 49 | We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains.
 50 | 
 51 | <p align="center">
 52 |   <img src="assets/radar.png" width="80%"/>
 53 | </p>
 54 | 
 55 | ## :trophy: Leaderboard
 56 | 
 57 | See numeric values at our [Leaderboard](https://huggingface.co/spaces/V-STaR-Bench/V-STaR-LeaderBoard) :1st_place_medal::2nd_place_medal::3rd_place_medal:
 58 | 
 59 | #### **How to join VBench Leaderboard?**
 60 | 
 61 | please contact us via email (jian.hu@qmul.ac.uk or zixu.cheng@qmul.ac.uk) to update your results.
 62 | 
 63 | #### Evaluation Criterion
 64 | 
 65 | To evaluate the open-ended *"what"* question, we use Qwen2.5-72B-Instruct to score answers from 0 to 4, denoting entirely incorrect, largely incorrect, largely correct, and entirely correct. Answers scoring above 2 are considered correct, allowing us to compute accuracy. 
 66 | 
 67 | For the *"when"* question, we follow the commonly used temporal grounding metrics, <R@n, tIoU=m>, which refers to the percentage of top-n prediction with temporal IoU score larger than m, and mean temporal IoU score (m\_tIoU). 
 68 | 
 69 | For the *"where"* question, we use the Average Precision score (AP@vIoU=m) and mean visual Intersection over the Union (m\_vIoU) of every annotated frame. We follow the proposed LGM and AM to measure a model's spatial-temporal reasoning ability. A higher LGM indicates a better overall spatio-temporal reasoning ability of the model, and a higher AM indicates a more average performance of the model on the three metrics.
 70 | 
 71 | <a name="usage"></a>
 72 | ## Usage
 73 | ##### Download V-STaR dataset on HuggingFace
 74 | 
 75 | ```
 76 | git lfs install
 77 | git clone https://huggingface.co/datasets/V-STaR-Bench/V-STaR
 78 | ```
 79 | 
 80 | ##### Use V-STaR to evaluate Video-LLMs
 81 | 
 82 | We provide our inference_demo.py script to test Qwen2.5-VL-7B with:
 83 | 
 84 | ```
 85 | python inference_demo.py 
 86 | ```
 87 | You can try your Video-LLMs to infer on V-STaR based on the provided scripts to test the model's spatio-temporal reasoning ability.
 88 | 
 89 | To evaluate the results, update your result file path in the eval.py script and run:
 90 | 
 91 | ```
 92 | python eval.py
 93 | ```
 94 | Noted: You need at least 2 NVIDIA A100 80G GPUs to run Qwen-2.5-72B for evaluation.
 95 | 
 96 | ## Submit to Leaderboard
 97 | 
 98 | please contact us via email (jian.hu@qmul.ac.uk or zixu.cheng@qmul.ac.uk) to update your results.
 99 | 
100 | <a name="citation_and_acknowledgement"></a>
101 | 
102 | ## :black_nib: Citation
103 | 
104 |    If you find our repo useful for your research, please consider citing our paper:
105 | 
106 |    ```bibtex
107 |    @misc{cheng2025vstarbenchmarkingvideollmsvideo,
108 |          title={V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning}, 
109 |          author={Zixu Cheng and Jian Hu and Ziquan Liu and Chenyang Si and Wei Li and Shaogang Gong},
110 |          year={2025},
111 |          eprint={2503.11495},
112 |          archivePrefix={arXiv},
113 |          primaryClass={cs.CV},
114 |          url={https://arxiv.org/abs/2503.11495}, 
115 |    }
116 |    ```
117 | 


--------------------------------------------------------------------------------
/assets/radar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/radar.png


--------------------------------------------------------------------------------
/assets/v_star_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/v_star_logo.png


--------------------------------------------------------------------------------
/assets/vis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/vis.png


--------------------------------------------------------------------------------
/assets/vis_v2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/V-STaR-Bench/V-STaR/c7debe7826959146d463bc991f7c8f3c1ee656b8/assets/vis_v2.png


--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
 1 | # **Download the Dataset**
 2 | 
 3 | Video: Please down the video data from [HuggingFace](https://huggingface.co/datasets/V-STaR-Bench/V-STaR).
 4 | 
 5 | Annotation: You can download the annotation from here or from [HuggingFace](https://huggingface.co/datasets/V-STaR-Bench/V-STaR).
 6 | 
 7 | ## Annotation Structure
 8 | 
 9 | Each annotation is organized by
10 | 
11 | ```
12 | {
13 |   "vid": ...,								# Video ID
14 |   "domain": ..., 
15 |   "fps": ..., 
16 |   "width": ..., 
17 |   "height": ..., 
18 |   "frame_count": ..., 							# total frame number of the video
19 |   "question": ..., 							# VQA question
20 |   "chain": "<think>...<think>", # spatio-temporal thinking chain
21 |   "object": ..., 							# object of the boxes
22 |   "answer": ..., 
23 |   "temporal_question": ..., 						# temporal grounding question
24 |   "timestamps": [..., ...], 
25 |   "spatial_question": ..., 						# Chain 1 spatial grounding question
26 |   "spatial_question_2": ..., 						# Chain 2 spatial grounding question	
27 |   "bboxes": [
28 | 	{"timestamp":...,
29 | 		"frame_index":...,
30 | 		"xmin": ...,
31 | 		"ymin": ...,
32 | 		"xmax": ...,
33 | 		"ymax": ...},
34 | 		...
35 |   ]
36 | }
37 | ```
38 | 
39 | 


--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import numpy as np
  3 | from tqdm import tqdm
  4 | import math
  5 | 
  6 | from transformers import AutoModelForCausalLM, AutoTokenizer
  7 | import ast
  8 | 
  9 | model_name = "Qwen/Qwen2.5-72B-Instruct"
 10 | 
 11 | model = AutoModelForCausalLM.from_pretrained(
 12 |     model_name,
 13 |     torch_dtype="auto",
 14 |     device_map="auto"
 15 | )
 16 | tokenizer = AutoTokenizer.from_pretrained(model_name)
 17 | 
 18 | 
 19 | system_prompt = """
 20 | As an AI assistant, your task is to evaluate a candidate answer in comparison to a given correct answer.
 21 | The question itself, the correct 'groundtruth' answer, and the candidate answer will be provided to you.
 22 | Your assessment should range from 0 to 3, \
 23 | based solely on the semantic similarity between the groundtruth and the candidate answer, \
 24 | disregarding any grammatical differences.
 25 | A rating of 0 suggests no similarity, implying the candidate answer is entirely incorrect.
 26 | A rating of 1 suggests low similarity, meaning the candidate answer is largely incorrect.
 27 | A rating of 2 suggests high similarity, meaning the candidate answer is largely correct.
 28 | Lastly, a rating of 3 indicates complete similarity, which means the candidate answer is entirely correct.
 29 | Your response should be a single integer from 0, 1, 2, or 3.
 30 | """
 31 | 
 32 | # tmpl = 'Groundtruth answer: {}\nCandidate answer: {}\nYour response: '
 33 | tmpl = 'Question: {}\nGroundtruth answer: {}\nCandidate answer: {}\nYour response: '
 34 | 
 35 | def qwen2_5_evaluation(question, gt, candidate):
 36 |     user_prompt=tmpl.format(question, gt, candidate)
 37 |     messages = [
 38 |         {"role": "system", "content": system_prompt},
 39 |         {"role": "user", "content": user_prompt}
 40 |     ]
 41 |     text = tokenizer.apply_chat_template(
 42 |         messages,
 43 |         tokenize=False,
 44 |         add_generation_prompt=True
 45 |     )
 46 |     model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
 47 | 
 48 |     generated_ids = model.generate(
 49 |         **model_inputs,
 50 |         max_new_tokens=512
 51 |     )
 52 |     generated_ids = [
 53 |         output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
 54 |     ]
 55 | 
 56 |     response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 57 |     score = response
 58 |     # print(score)
 59 |     # breakpoint()
 60 |     try:
 61 |         score = int(score)
 62 |     except (ValueError, TypeError):
 63 |         score = -1
 64 |     return score
 65 | 
 66 | 
 67 | def calculate_temporal_iou(gt_range, pred_range):
 68 |     """ calculate Temporal IoU"""
 69 |     if not pred_range:
 70 |         return 0.0  
 71 |     
 72 | 
 73 |     if isinstance(pred_range, str):
 74 |         try:
 75 |             pred_range = ast.literal_eval(pred_range)
 76 |         except (ValueError, SyntaxError):
 77 |             return 0.0
 78 | 
 79 |     if not isinstance(pred_range, (list, tuple)) or len(pred_range) != 2 or \
 80 |     not all(isinstance(x, (int, float)) for x in pred_range):
 81 |         return 0.0
 82 | 
 83 |     gt_start, gt_end = gt_range
 84 |     pred_start, pred_end = pred_range
 85 |     intersection = max(0, min(gt_end, pred_end) - max(gt_start, pred_start))
 86 |     union = max(gt_end, pred_end) - min(gt_start, pred_start)
 87 |     return intersection / union if union > 0 else 0.0
 88 | 
 89 | 
 90 | def compute_iou(gt_bbox, pred_bbox):
 91 |         """calculate 2 bbox IoU"""
 92 |         if not isinstance(pred_bbox, (list, tuple)) or len(pred_bbox) != 4:
 93 |             return 0.0
 94 |         
 95 |         # GT bbox
 96 |         gt_xmin, gt_ymin, gt_xmax, gt_ymax = gt_bbox['xmin'], gt_bbox['ymin'], gt_bbox['xmax'], gt_bbox['ymax']
 97 |         pred_xmin, pred_ymin, pred_xmax, pred_ymax = pred_bbox
 98 | 
 99 |         # Intersection
100 |         x1 = max(gt_xmin, pred_xmin)
101 |         y1 = max(gt_ymin, pred_ymin)
102 |         x2 = min(gt_xmax, pred_xmax)
103 |         y2 = min(gt_ymax, pred_ymax)
104 |         intersection = max(0, x2 - x1) * max(0, y2 - y1)
105 | 
106 |         # Union
107 |         gt_area = (gt_xmax - gt_xmin) * (gt_ymax - gt_ymin)
108 |         pred_area = (pred_xmax - pred_xmin) * (pred_ymax - pred_ymin)
109 |         union = gt_area + pred_area - intersection
110 | 
111 |         return intersection / union if union > 0 else 0.0
112 | 
113 | def calculate_bbox_iou(gt_bbox, pred_bboxes):
114 |     """Calculate single BBox IoU, support multiple prediction frames to get maximum IoU"""
115 |     try:
116 |         if not pred_bboxes:
117 |             return 0.0
118 | 
119 |         if isinstance(pred_bboxes[0], (int, float)) and len(pred_bboxes) == 4:
120 |             pred_bboxes = [pred_bboxes]
121 | 
122 |         return max([compute_iou(gt_bbox, pred_bbox) for pred_bbox in pred_bboxes])
123 |     except:
124 |         return 0.0
125 | 
126 | def calculate_spatial_metrics(gt_bboxes, pred_bboxes):
127 |     """calculate Spatial IoU and mAP"""
128 |     if not pred_bboxes:
129 |         return [0.0] * 5, 0.0
130 | 
131 |     iou_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
132 |     ious = []
133 |     aps = []
134 |     for gt_bbox_entry in gt_bboxes:
135 |         frame_id = str(box["timestamp"])
136 |         if frame_id in pred_bboxes:
137 |                 pred_bbox = pred_bboxes[frame_id]
138 |                 gt_bbox = {
139 |                     "xmin": box["xmin"],
140 |                     "ymin": box["ymin"],
141 |                     "xmax": box["xmax"],
142 |                     "ymax": box["ymax"]
143 |                 }
144 |                 iou = calculate_bbox_iou(gt_bbox, pred_bbox)
145 |                 ious.append(iou)
146 |             else:
147 |                 ious.append(0.0)
148 |     mIoU = np.mean(ious) if ious else 0.0
149 | 
150 |     for threshold in iou_thresholds:
151 |         scores = [1 if iou >= threshold else 0 for iou in ious]
152 |         if len(ious) > 0:
153 |             aps.append(np.mean(scores))
154 |         else:
155 |             aps.append(0.0)
156 |     return aps, mIoU
157 | 
158 | def calculate_spatial_random(gt_bboxes, w, h):
159 |     """calculate Spatial IoU and mAP"""
160 |     pred_bbox = [0, 0, w, h]
161 |     iou_thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
162 |     ious = []
163 |     aps = []
164 |     for gt_bbox_entry in gt_bboxes:
165 |         for frame_id, gt_bbox in gt_bbox_entry.items():
166 |             iou = calculate_bbox_iou(gt_bbox, pred_bbox)
167 |             ious.append(iou)
168 |     mIoU = np.mean(ious) if ious else 0.0
169 | 
170 |     for threshold in iou_thresholds:
171 |         scores = [1 if iou >= threshold else 0 for iou in ious]
172 |         if len(ious) > 0:
173 |             aps.append(np.mean(scores))
174 |         else:
175 |             aps.append(0.0)
176 |     return aps, mIoU
177 | 
178 | # evaluate the json file
179 | def evaluate_json(file_path):
180 |     with open(file_path, 'r') as f:
181 |         data = json.load(f)
182 | 
183 |     model_name = file_path.split("/")[-1].split("_")[0]
184 |     domains = {}
185 |     durations = {}
186 |     overall_stats = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)],
187 |                     "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[],
188 |                     "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[],
189 |                     "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]}
190 | 
191 | 
192 |     for idx, item in enumerate(tqdm(data, desc=f"Evaluating {model_name} results", unit="item")):
193 |         video_length = round(item['frame_count']/item['fps'], 2)
194 |         w, h = item['width'], item['height']
195 |         domain = item.get("domain", "unknown")
196 |         if domain not in domains:
197 |             domains[domain] = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)],
198 |                     "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[],
199 |                     "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[],
200 |                     "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]}
201 |         
202 |         if video_length < 60:
203 |             duration = "Short"
204 |         elif 60 <= video_length < 180:
205 |             duration = "Medium"
206 |         else:
207 |             duration = "Long"
208 |         if duration not in durations:
209 |             durations[duration] = {"all_rating":[], "valid_rating": [], "correct_num":0, "temporal_ious": [], "temporal_ious_2": [], "spatial_aps": [[] for _ in range(5)],
210 |                     "spatial_aps_2": [[] for _ in range(5)], "spatial_mious": [], "spatial_mious_2": [], "random_tious": [], "random_aps": [[] for _ in range(5)], "random_vious":[],
211 |                     "vqa_temporal_idx":[], "vqa_spatial_idx":[], "temporal_spatial_idx":[],"vqa_temp_spatial_idx":[],
212 |                     "vqa_temporal_idx_2":[], "vqa_spatial_idx_2":[], "temporal_spatial_idx_2":[],"vqa_temp_spatial_idx_2":[]}
213 | 
214 |         if 'answer_vqa' in item and item['answer_vqa']:
215 |             score = qwen2_5_evaluation(item['question'], item['answer'], item['answer_vqa'])
216 |         else:
217 |             continue
218 |         overall_stats["all_rating"].append(score if score != -1 else 0)
219 |         domains[domain]["all_rating"].append(score if score != -1 else 0)
220 |         durations[duration]["all_rating"].append(score if score != -1 else 0)
221 |         if score != -1:
222 |             overall_stats["valid_rating"].append(score)
223 |             domains[domain]["valid_rating"].append(score)
224 |             durations[duration]["valid_rating"].append(score)
225 |         if score >= 2:
226 |             overall_stats["correct_num"] += 1
227 |             domains[domain]["correct_num"] += 1
228 |             durations[duration]["correct_num"] += 1
229 |         data[idx]["VQA_score"] = score
230 |         # answer_temporal
231 |         if 'answer_temporal' in item and item['answer_temporal']:
232 |             temporal_iou = calculate_temporal_iou(item['timestamps'], item['answer_temporal'])
233 |         else:
234 |             temporal_iou = 0.0
235 | 
236 |         overall_stats["temporal_ious"].append(temporal_iou)
237 |         domains[domain]["temporal_ious"].append(temporal_iou)
238 |         durations[duration]["temporal_ious"].append(temporal_iou)
239 |         data[idx]["temporal_IoU"] = temporal_iou
240 | 
241 |         # answer_temporal_2
242 |         if 'answer_temporal_2' in item and item['answer_temporal_2']:
243 |             temporal_iou_2 = calculate_temporal_iou(item['timestamps'], item['answer_temporal_2'])
244 |         else:
245 |             temporal_iou_2 = 0.0
246 |         
247 |         overall_stats["temporal_ious_2"].append(temporal_iou_2)        
248 |         domains[domain]["temporal_ious_2"].append(temporal_iou_2)
249 |         durations[duration]["temporal_ious_2"].append(temporal_iou_2)
250 |         data[idx]["temporal_IoU_2"] = temporal_iou_2
251 | 
252 |         random_iou = calculate_temporal_iou(item['timestamps'],[0, video_length])
253 |         overall_stats["random_tious"].append(random_iou)
254 |         domains[domain]["random_tious"].append(random_iou)
255 |         durations[duration]["random_tious"].append(random_iou)
256 | 
257 |         # answer_spatial
258 |         if 'answer_spatial' in item and item['answer_spatial']:
259 |             aps, mIoU = calculate_spatial_metrics(item['bboxes'], item['answer_spatial'])
260 |         else:
261 |             aps, mIoU = [0.0] * 5, 0.0
262 |         for i, ap in enumerate(aps):
263 |             domains[domain]["spatial_aps"][i].append(ap)
264 |             durations[duration]["spatial_aps"][i].append(ap)
265 |             overall_stats["spatial_aps"][i].append(ap)
266 |         domains[domain]["spatial_mious"].append(mIoU)
267 |         durations[duration]["spatial_mious"].append(mIoU)
268 |         overall_stats["spatial_mious"].append(mIoU)
269 |         data[idx]["AP1@0.1:0.9"] = aps
270 |         data[idx]["spatial_mIoU"] = mIoU
271 | 
272 |         # answer_spatial_2
273 |         if 'answer_spatial_2' in item and item['answer_spatial_2']:
274 |             aps_2, mIoU_2 = calculate_spatial_metrics(item['bboxes'], item['answer_spatial_2'])
275 |         else:
276 |             aps_2, mIoU_2 = [0.0] * 5, 0.0
277 |         for i, ap in enumerate(aps_2):
278 |             domains[domain]["spatial_aps_2"][i].append(ap)
279 |             durations[duration]["spatial_aps_2"][i].append(ap)
280 |             overall_stats["spatial_aps_2"][i].append(ap)
281 |         domains[domain]["spatial_mious_2"].append(mIoU_2)
282 |         durations[duration]["spatial_mious_2"].append(mIoU_2)
283 |         overall_stats["spatial_mious_2"].append(mIoU_2)
284 |         data[idx]["AP2@0.1:0.9"] = aps_2
285 |         data[idx]["spatial_mIoU_2"] = mIoU_2
286 | 
287 | 
288 |         random_aps, random_mIoU = calculate_spatial_random(item['bboxes'], w, h)
289 |         for i, ap in enumerate(random_aps):
290 |             domains[domain]["random_aps"][i].append(ap)
291 |             durations[duration]["random_aps"][i].append(ap)
292 |             overall_stats["random_aps"][i].append(ap)
293 |         domains[domain]["random_vious"].append(random_mIoU)
294 |         durations[duration]["random_vious"].append(random_mIoU)
295 |         overall_stats["random_vious"].append(random_mIoU)
296 | 
297 |         with open(f'metrics/{model_name}_merged_v2_metrics.json', 'w') as f:
298 |             json.dump(data, f, indent=4)
299 |         
300 |         if score >= 2 and temporal_iou >= 0.3:
301 |             domains[domain]["vqa_temporal_idx"].append(idx)
302 |             durations[duration]["vqa_temporal_idx"].append(idx)
303 |             overall_stats["vqa_temporal_idx"].append(idx)
304 |         if score >= 2 and temporal_iou_2 >= 0.3:
305 |             domains[domain]["vqa_temporal_idx_2"].append(idx)
306 |             durations[duration]["vqa_temporal_idx_2"].append(idx)
307 |             overall_stats["vqa_temporal_idx_2"].append(idx)
308 |         if score >= 2 and mIoU >= 0.1:
309 |             domains[domain]["vqa_spatial_idx"].append(idx)
310 |             durations[duration]["vqa_spatial_idx"].append(idx)
311 |             overall_stats["vqa_spatial_idx"].append(idx)
312 |         if score >= 2 and mIoU_2 >= 0.1:
313 |             domains[domain]["vqa_spatial_idx_2"].append(idx)
314 |             durations[duration]["vqa_spatial_idx_2"].append(idx)
315 |             overall_stats["vqa_spatial_idx_2"].append(idx)
316 |         if temporal_iou >= 0.3 and mIoU >= 0.1:
317 |             domains[domain]["temporal_spatial_idx"].append(idx)
318 |             durations[duration]["temporal_spatial_idx"].append(idx)
319 |             overall_stats["temporal_spatial_idx"].append(idx)
320 |         if temporal_iou_2 >= 0.3 and mIoU_2 >= 0.1:
321 |             domains[domain]["temporal_spatial_idx_2"].append(idx)
322 |             durations[duration]["temporal_spatial_idx_2"].append(idx)
323 |             overall_stats["temporal_spatial_idx_2"].append(idx)
324 |         if score >= 2 and temporal_iou >= 0.3 and mIoU >= 0.1:
325 |             domains[domain]["vqa_temp_spatial_idx"].append(idx)
326 |             durations[duration]["vqa_temp_spatial_idx"].append(idx)
327 |             overall_stats["vqa_temp_spatial_idx"].append(idx)
328 |         if score >= 2 and temporal_iou_2 >= 0.3 and mIoU_2 >= 0.1:
329 |             domains[domain]["vqa_temp_spatial_idx_2"].append(idx)
330 |             durations[duration]["vqa_temp_spatial_idx_2"].append(idx)
331 |             overall_stats["vqa_temp_spatial_idx_2"].append(idx)
332 | 
333 |     with open(f'metrics/{model_name}_merged_v2_metrics.json', 'w') as f:
334 |             json.dump(data, f, indent=4)
335 | 
336 |     def print_stats(label, stats, total_samples):
337 |         avg_all_score = np.mean(stats["all_rating"])
338 |         avg_valid_score = np.mean(stats["valid_rating"]) if stats["valid_rating"] else 0
339 |         acc_vqa = stats["correct_num"] / total_samples
340 | 
341 |         r1_iou30 = np.mean([1 if iou >= 0.3 else 0 for iou in stats["temporal_ious"]])
342 |         r1_iou50 = np.mean([1 if iou >= 0.5 else 0 for iou in stats["temporal_ious"]])
343 |         r1_iou70 = np.mean([1 if iou >= 0.7 else 0 for iou in stats["temporal_ious"]])
344 |         mean_temporal_iou = np.mean(stats["temporal_ious"])
345 | 
346 |         r1_iou30_2 = np.mean([1 if iou >= 0.3 else 0 for iou in stats["temporal_ious_2"]])
347 |         r1_iou50_2 = np.mean([1 if iou >= 0.5 else 0 for iou in stats["temporal_ious_2"]])
348 |         r1_iou70_2 = np.mean([1 if iou >= 0.7 else 0 for iou in stats["temporal_ious_2"]])
349 |         mean_temporal_iou_2 = np.mean(stats["temporal_ious_2"])
350 | 
351 |         mean_aps = [np.mean(ar_list) for ar_list in stats["spatial_aps"]]
352 |         mean_miou = np.mean(stats["spatial_mious"])
353 | 
354 |         mean_aps_2 = [np.mean(ar_list) for ar_list in stats["spatial_aps_2"]]
355 |         mean_miou_2 = np.mean(stats["spatial_mious_2"])
356 | 
357 | 
358 |         vqa_temp = len(stats["vqa_temporal_idx"]) / total_samples
359 |         vqa_temp_2 = len(stats["vqa_temporal_idx_2"]) / total_samples
360 |         vqa_spat = len(stats["vqa_spatial_idx"]) / total_samples
361 |         vqa_spat_2 = len(stats["vqa_spatial_idx_2"]) / total_samples
362 |         temp_spat = len(stats["temporal_spatial_idx"]) / total_samples
363 |         temp_spat_2 = len(stats["temporal_spatial_idx_2"]) / total_samples
364 |         vqa_temp_spat = len(stats["vqa_temp_spatial_idx"]) / total_samples
365 |         vqa_temp_spat_2 = len(stats["vqa_temp_spatial_idx_2"]) / total_samples
366 | 
367 |         print(f"{label}:")
368 |         print(f"VQA: Avg All Score: {avg_all_score:.4f}, Avg Valid Score: {avg_valid_score:.4f}, Accuracy: {acc_vqa:.4f}")
369 |         print("Chain 1:")
370 |         print(f"Temporal Answer: R1@IoU=0.3: {r1_iou30:.4f}, R1@IoU=0.5: {r1_iou50:.4f}, R1@IoU=0.7: {r1_iou70:.4f}, Mean IoU: {mean_temporal_iou:.4f}")
371 |         print(f"Spatial Answer: mAP@0.1: {mean_aps[0]:.4f}, mAP@0.3: {mean_aps[1]:.4f}, mAP@0.5: {mean_aps[2]:.4f}, mAP@0.7: {mean_aps[3]:.4f}, mAP@0.9: {mean_aps[4]:.4f}, Mean mIoU: {mean_miou:.4f}")
372 |         print("\n")  
373 |         print("Chain 2:")
374 |         print(f"Temporal Answer: R1@IoU=0.3: {r1_iou30_2:.4f}, R1@IoU=0.5: {r1_iou50_2:.4f}, R1@IoU=0.7: {r1_iou70_2:.4f}, Mean IoU: {mean_temporal_iou_2:.4f}")
375 |         print(f"Spatial Answer: mAP@0.1: {mean_aps_2[0]:.4f}, mAP@0.3: {mean_aps_2[1]:.4f}, mAP@0.5: {mean_aps_2[2]:.4f}, mAP@0.7: {mean_aps_2[3]:.4f}, mAP@0.9: {mean_aps_2[4]:.4f}, Mean mIoU: {mean_miou_2:.4f}")
376 |         print("\n")
377 | 
378 |         AM = (acc_vqa + mean_temporal_iou + mean_miou)/3
379 |         AM2 = (acc_vqa + mean_temporal_iou_2 + mean_miou_2)/3
380 |         mAM = (AM + AM2) / 2
381 | 
382 |         LGM = -(math.log(1 - acc_vqa) + math.log(1 - mean_temporal_iou) + math.log(1 - mean_miou)) / 3     
383 |         LGM2 = -(math.log(1 - acc_vqa) + math.log(1 - mean_temporal_iou_2) + math.log(1 - mean_miou_2)) / 3
384 |         mLGM = (LGM + LGM2) / 2
385 | 
386 |         print(f"AM1:{AM:.4f}, AM2:{AM2:.4f}, mAM:{mAM:.4f}")
387 |         print(f"LGM1:{LGM:.4f}, LGM2:{LGM2:.4f}, mLGM:{mLGM:.4f}\n")
388 | 
389 |         print("Combined resutls:")
390 |         print(f"VQA & Temp:  Chain 1: {vqa_temp:.4f}, Chain 2: {vqa_temp_2:.4f}")
391 |         print(f"VQA & Spat: Chain 1: {vqa_spat:.4f} Chain 2: {vqa_spat_2:.4f}")
392 |         print(f"Temp & Spat:  Chain 1: {temp_spat:.4f} Chain 2: {temp_spat_2:.4f}")
393 |         print(f"VQA & Temp & Spat:  Chain 1:{vqa_temp_spat:.4f} Chain 2: {vqa_temp_spat_2:.4f}")
394 |         print(f"VQA & Temp list: \n Chain 1:{stats['vqa_temporal_idx']} \nChain 2:{stats['vqa_temporal_idx_2']}")
395 |         print(f"VQA & Spat list: \n Chain 1:{stats['vqa_spatial_idx']} \n Chain 2: {stats['vqa_spatial_idx_2']}")
396 |         print(f"Temp & Spat list:  \n Chain 1:{stats['temporal_spatial_idx']} \n Chain 2: {stats['temporal_spatial_idx_2']}")
397 |         print(f"VQA & Temp & Spat list: \n Chain 1:{stats['vqa_temp_spatial_idx']} \n Chain 2:{stats['vqa_temp_spatial_idx_2']}\n")
398 | 
399 |     print_stats("Overall Statistics", overall_stats, len(data))
400 |     for duration, stats in durations.items():
401 |         print_stats(f"Video Length: {duration}", stats, len(stats["all_rating"]))
402 |     for domain, stats in domains.items():
403 |         print_stats(f"Domain: {domain}", stats, len(stats["all_rating"]))
404 | 
405 | 
406 | # print("\nEvaluating GPT-4o:\n")
407 | # evaluate_json('results/gpt4o/gpt4o_answer_update_merged.json')
408 | 
409 | # print("\nEvaluating Gemini-2-Flash:\n")
410 | # evaluate_json('results/gemini2/gemini2_answer_merged.json')
411 | 
412 | # print("\nEvaluating Video-Llama3:\n")
413 | # evaluate_json('results/videollama3/videollama3_answer_merged.json')
414 | 
415 | print("\nEvaluating Qwen2.5-VL:")
416 | evaluate_json('results/qwen2_5/qwen2-5-32B_answer_update.json')
417 | 
418 | # print("\nEvaluating InternVL-2.5-38B:")
419 | # evaluate_json('results/internvl2_5/internvl2-5-38B_answer.json')
420 | 
421 | # print("\nEvaluating InternVL-2.5:")
422 | # evaluate_json('results/internvl2_5/internvl2_5_answer_merged.json')
423 | 
424 | # print("\nEvaluating Llava-Video:")
425 | # evaluate_json('results/llava-video/llavavideo_answer_merged.json')
426 | 
427 | # print("\nEvaluating Qwen2-VL:")
428 | # evaluate_json('results/qwen2/qwen2vl_answer_merged.json')
429 | 
430 | # print("\nEvaluating VideoChat2:")
431 | # evaluate_json('results/videochat2/videochat2_answer_merged.json')
432 | 
433 | # print("\nEvaluating Oryx-1.5:")
434 | # evaluate_json('results/oryx-1.5/oryx15_answer_merged.json')
435 | 
436 | # print("\nEvaluating VideoCCAM:")
437 | # evaluate_json('results/videoccam/videoccam12_answer_merged.json')
438 | 
439 | # print("\nEvaluating TimeChat:")
440 | # evaluate_json('results/timechat/timechat_answer_merged.json')
441 | 
442 | # print("\nEvaluating VTimeLLM:")
443 | # evaluate_json('results/vtimellm/vtimellm_answer_merged.json')
444 | 
445 | # print("\nEvaluating Trace:")
446 | # evaluate_json('results/trace/trace_answer_merged.json')
447 | 
448 | # print("\nEvaluating Sa2VA:")
449 | # evaluate_json('results/sa2va/sa2va_answer_merged.json')
450 | 
451 | 
452 | 


--------------------------------------------------------------------------------
/inference_demo.py:
--------------------------------------------------------------------------------
  1 | import cv2  # We're using OpenCV to read video, to install !pip install opencv-python
  2 | import base64
  3 | import time
  4 | import os
  5 | import requests
  6 | from tqdm import tqdm
  7 | import json
  8 | import re
  9 | import torch
 10 | import math
 11 | 
 12 | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
 13 | from qwen_vl_utils import process_vision_info
 14 | 
 15 | model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
 16 | 
 17 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 18 |     model_path,
 19 |     torch_dtype=torch.bfloat16,
 20 |     attn_implementation="flash_attention_2",
 21 |     device_map="auto"
 22 | )
 23 | processor = AutoProcessor.from_pretrained(model_path)
 24 | 
 25 | def inference(video_path, prompt, max_new_tokens=2048, total_pixels=20480 * 28 * 28, min_pixels=16 * 28 * 28):
 26 |     messages = [
 27 |         {"role": "system", "content": "You are a helpful assistant."},
 28 |         {"role": "user", "content": [
 29 |                 {"type": "text", "text": prompt},
 30 |                 {"type": "video", "video": video_path, "total_pixels": total_pixels, "min_pixels": min_pixels, "fps": 1.0},
 31 |             ]
 32 |         },
 33 |     ]
 34 |     text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 35 |     image_inputs, video_inputs, video_kwargs = process_vision_info([messages], return_video_kwargs=True)
 36 |     fps_inputs = video_kwargs['fps']
 37 |     print("video input:", video_inputs[0].shape)
 38 |     num_frames, _, resized_height, resized_width = video_inputs[0].shape
 39 |     # print("num of video tokens:", int(num_frames / 2 * resized_height / 28 * resized_width / 28))
 40 |     inputs = processor(text=[text], images=image_inputs, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt")
 41 |     inputs = inputs.to('cuda')
 42 | 
 43 |     output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
 44 |     generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
 45 |     output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
 46 |     return output_text[0], resized_height, resized_width
 47 | 
 48 | def read_anno(anno_file):
 49 |     with open(anno_file, 'r') as f:
 50 |         data = json.load(f)
 51 |     return data
 52 | 
 53 | def find_video(video_folder, vid):
 54 |     """
 55 |     Finds the vid.mp4 file in the video_folder and its subfolders.
 56 | 
 57 |     Args.
 58 |         video_folder (str): path of the folder to search.
 59 |         vid (str): the filename of the video (without extension).
 60 | 
 61 |     Returns.
 62 |         str: absolute path to the vid.mp4 file, or None if not found.
 63 |     """
 64 |     target_filename = f"{vid}.mp4"
 65 |     for root, _, files in os.walk(video_folder):
 66 |         if target_filename in files:
 67 |             return os.path.join(root, target_filename)
 68 |     return None
 69 | 
 70 | def get_answer_vqa(data, video_path):
 71 |     prompt = f"Answer the question about the video: {data['question']} \n (If the answer is a person, you don't need to identify the person.)"
 72 |     answer_vqa, _, _ = inference(video_path, prompt)
 73 |     return answer_vqa
 74 | 
 75 | def get_answer_temporal(data, video_path):
 76 |     video_length = round(data['frame_count']/data['fps'], 2)
 77 |     temporal_question = data['temporal_question']
 78 |     prompt = f"This video is {video_length} seconds long. Answer the question about the video: {temporal_question} \n Output the start and end moment timestamps.",
 79 |     answer_temporal, _, _ = inference(video_path, prompt)
 80 |     return answer_temporal
 81 | 
 82 | def get_answer_temporal_2(data, video_path, bboxes):
 83 |     video_length = round(data['frame_count']/data['fps'], 2)
 84 |     temporal_question = data['temporal_question']
 85 |     w, h = data['width'], data['height']
 86 |     prompt = f"This video is {video_length} seconds long with a resolution of {w}x{h} (width x height). Answer the question about the video: {temporal_question} \n There are {len(bboxes)} bounding boxes of the key object related to the question in the video without knowing the time, which are:{bboxes}. Output the start and end moment timestamps.",
 87 |     answer_temporal, _, _ = inference(video_path, prompt)
 88 |     return answer_temporal
 89 | 
 90 | def get_answer_spatial(data, video_path):
 91 |     video_length = round(data['frame_count']/data['fps'], 2)
 92 |     st, et = math.ceil(data['timestamps'][0]), math.floor(data['timestamps'][1])
 93 |     time_range = list(range(st, et + 1))
 94 |     w, h = data['width'], data['height']
 95 |     spatial_question = data['spatial_question']
 96 |     prompt = f"""Please answer the question about the video: {spatial_question} with a series of bounding boxes in [x1, y1, x2, y2] format. \n
 97 |                     For each whole second within the time range {time_range} provided (inclusive of the boundaries), output a series of bounding boxes of the object in JSON format. The keys should be the whole seconds (as strings), and the values should be the box in [x1, y1, x2, y2] format.
 98 |                     Example output: {{"{time_range[0]}": [x1, y1, x2, y2],...}}
 99 |                     """
100 |     answer_spatial, input_height, input_width = inference(video_path, prompt)
101 |     return answer_spatial, input_height, input_width
102 | 
103 | def get_answer_spatial_2(data, video_path, bboxes):
104 |     video_length = round(data['frame_count']/data['fps'], 2)
105 |     st, et = math.ceil(data['timestamps'][0]), math.floor(data['timestamps'][1])
106 |     time_range = list(range(st, et + 1))
107 |     w, h = data['width'], data['height']
108 |     spatial_question = data['spatial_question_2']
109 |     prompt = f"""Please answer the question about the video: {spatial_question} with a series of bounding boxes in [x1, y1, x2, y2] format. \n
110 |                     For each whole second that may related to the question, output a series of bounding boxes of the object in JSON format. You only need to output {len(bboxes)} bbox(es). You need to determine which frame is related to the question, and you don't need to output the bbox for the frames not related to the question.
111 |                     The keys should be the whole seconds (as strings), and the values should be the bounding box in [x0,y0,x1,y1] format. 
112 |                     \n Example output:
113 |                     {{"0": [x0,y0,x1,y1], "1":..., ..., "{len(bboxes)}":...}} (if the frames at 0~{len(bboxes)} second are related to the questions)
114 |                     """
115 |     answer_spatial, input_height, input_width = inference(video_path, prompt)
116 |     return answer_spatial, input_height, input_width
117 | 
118 | def extract_timestamps(result):
119 |     """extract timestamps from the answer"""
120 |     match = re.findall(r"\b\d+(?:\.\d+)?\b", result)
121 |     return [float(match[0]), float(match[1])] if len(match) == 2 else []
122 | 
123 | 
124 | def fix_incomplete_json(json_str):
125 |     """
126 |     fix the incomplete brackets of the json
127 |     """
128 |     # Counting left and right brackets
129 |     open_square = json_str.count('[')
130 |     close_square = json_str.count(']')
131 |     open_curly = json_str.count('{')
132 |     close_curly = json_str.count('}')
133 | 
134 |     # Complete the square brackets
135 |     if open_square > close_square:
136 |         json_str += ']' * (open_square - close_square)
137 |     elif close_square > open_square:
138 |         json_str = '[' * (close_square - open_square) + json_str
139 | 
140 |     # Complete the curly brackets
141 |     if open_curly > close_curly:
142 |         json_str += '}' * (open_curly - close_curly)
143 |     elif close_curly > open_curly:
144 |         json_str = '{' * (close_curly - open_curly) + json_str
145 | 
146 |     return json_str
147 | 
148 | 
149 | def extract_bounding_boxes(answer_spatial, data, input_height, input_width):
150 |     """
151 |     Extract bounding boxes from the input answer_spatial and denormalize the coordinates using the width and height from the data.
152 |     """
153 |     w, h = data['width'], data['height'] 
154 | 
155 |     def denormalize_bbox(bbox):
156 |         """
157 |         denormalize the coordinates of bbox
158 |         """
159 |         try:
160 |             if len(bbox) == 1:
161 |                 bbox = bbox[0]
162 |             if len(bbox) == 2:
163 |                 bbox = bbox[1]
164 |             x_min = int(bbox[0] / input_width * w)
165 |             y_min = int(bbox[1] / input_height * h)
166 |             x_max = int(bbox[2] / input_width * w)
167 |             y_max = int(bbox[3] / input_height * h)
168 |             return [x_min, y_min, x_max, y_max]
169 |         except Exception as e:
170 |             print(f"Processing {bbox} occurs Error {e}")
171 |             return bbox
172 | 
173 |     # match markdown json
174 |     markdown_pattern = r'```json\s*\n(\[.*?\]|\{.*?\})\s*\n```'
175 |     match = re.search(markdown_pattern, answer_spatial, re.DOTALL)
176 |     if not match:
177 |         # If there is no Markdown wrapper, then try to match the JSON format directly
178 |         json_pattern = r'(\[[\s\S]*\]|\{[\s\S]*\})'
179 |         match = re.search(json_pattern, answer_spatial, re.DOTALL)
180 |     if match:
181 |         # match bbox in JSON
182 |         bounding_boxes_str = match.group(1).strip()
183 |         # Replace single quotes with double quotes to conform to the JSON specification
184 |         bounding_boxes_str = bounding_boxes_str.replace("'", '"')
185 |         try:
186 |             # Convert strings to dictionary or list format
187 |             bounding_boxes = json.loads(bounding_boxes_str)
188 |             # If it's a list and contains a dictionary inside, expand it to a single dictionary
189 |             if isinstance(bounding_boxes, list) and all(isinstance(item, dict) for item in bounding_boxes):
190 |                 combined_dict = {}
191 |                 for item in bounding_boxes:
192 |                     combined_dict.update(item)
193 |                 bounding_boxes = combined_dict
194 |                 # Determine if the extracted JSON is a dictionary or a list.
195 |             if isinstance(bounding_boxes, list):
196 |                 # bounding boxes in list
197 |                 return {str(box[0]): box[1] for box in bounding_boxes}
198 |             elif isinstance(bounding_boxes, dict):
199 |                 # bounding boxes in dictionary
200 |                 return {key: value for key, value in bounding_boxes.items()}
201 |         except Exception as e:
202 |             # if failed, try to fix it.
203 |             fixed_bounding_boxes_str = fix_incomplete_json(bounding_boxes_str)
204 |             try:
205 |                 bounding_boxes = json.loads(fixed_bounding_boxes_str)
206 |                 if isinstance(bounding_boxes, list):
207 |                     return [box for box in bounding_boxes]
208 |                 elif isinstance(bounding_boxes, dict):
209 |                     return {key: value for key, value in bounding_boxes.items()}
210 |             except Exception as e:
211 |                 print(f"Failed after fixing: {e}\nExtracted JSON: {fixed_bounding_boxes_str}")
212 |                 return None
213 |     else:
214 |         print("No match found for the bounding box JSON.")
215 |         return None
216 | 
217 | def test_qwen2_5vl(video_folder, anno_file, result_file):
218 |     anno = read_anno(anno_file)
219 | 
220 |     for idx, data in enumerate(tqdm(anno, desc="Processing videos", unit="video")):
221 |         try:
222 |             vid = data['vid']
223 |             timestamps = data['timestamps']
224 |             video_length = round(data['frame_count']/data['fps'], 1)
225 |             boxes = [[box_data["xmin"], box_data["ymin"], box_data["xmax"], box_data["ymax"]] \
226 |                         for box in data["bboxes"] for box_data in box.values()]
227 |             video_path = find_video(video_folder, vid)
228 |             answer_vqa = get_answer_vqa(data, video_path)
229 |             # chain one
230 |             answer_temporal = get_answer_temporal(data, video_path)
231 |             answer_temporal_post = extract_timestamps(answer_temporal)
232 | 
233 |             answer_spatial, input_height, input_width = get_answer_spatial(data, video_path)
234 |             answer_spatial_post = extract_bounding_boxes(answer_spatial, data, input_height, input_width)
235 |             
236 |             # chain two
237 |             answer_spatial_2, input_height, input_width = get_answer_spatial_2(data, video_path, boxes)
238 |             answer_spatial_post_2 = extract_bounding_boxes(answer_spatial_2, data, input_height, input_width)
239 | 
240 |             answer_temporal_2 = get_answer_temporal_2(data, video_path, boxes)
241 |             answer_temporal_post_2 = extract_timestamps(answer_temporal_2)
242 |             
243 |             # update data
244 |             data['answer_vqa'] = answer_vqa
245 |             data['answer_temporal_pre'] = answer_temporal
246 |             data['answer_temporal'] = answer_temporal_post
247 |             data['answer_spatial_pre'] = answer_spatial
248 |             data['answer_spatial'] = answer_spatial_post
249 | 
250 |             data['answer_spatial_pre_2'] = answer_spatial_2
251 |             data['answer_spatial_2'] = answer_spatial_post_2
252 |             data['answer_temporal_pre_2'] = answer_temporal_2
253 |             data['answer_temporal_2'] = answer_temporal_post_2
254 |             # update result_file
255 |             with open(result_file, 'w') as f:
256 |                 json.dump(anno, f, indent=4)
257 |         except Exception as e:
258 |             print("")
259 |             print(f"ERROR in data {idx}: {e}.")
260 |             continue
261 | 
262 | 
263 | if __name__ == "__main__":
264 |     video_folder = "/Path/to/video/folder"
265 |     anno_file = "/path/to/anno/file.json"
266 |     result_file = "/path/to/result/file.json"
267 |     test_qwen2_5vl(video_folder, anno_file, result_file)
268 | 


--------------------------------------------------------------------------------