├── README.md
├── all_task.json
├── asset
├── domain_123.jpeg
├── domain_45.jpeg
├── domain_45.png
├── key_domains.png
├── results_tra.png
├── results_vlms.png
└── sun.jpeg
├── evaluate_utils.py
├── evaluate_vlm.py
└── qwen2vl_all.py
/README.md:
--------------------------------------------------------------------------------
1 | # ✨✨VLADBench✨✨: Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
2 |
3 |
4 |
5 | 
6 | 
7 | 
8 | 
9 | 
10 | 
11 |
12 |
13 |
17 |
18 |
19 |
20 | # Overview
21 |
22 | Current benchmarks designed for VLM-based AD face several notable limitations:
23 |
24 | - **Coarse-grained Categories**: The underlying datasets of
25 | the VLM-based models are often simplistic, typically categorizing tasks into perception, prediction, and planning
26 | with reasoning, which are incomplete for evaluating the nuanced cognitive and reasoning abilities required for safe and
27 | reliable AD.
28 |
29 | - **Lack of Dynamic Elements**: Both static and dynamic scenes are crucial for evaluating AD systems, a robust analysis of dynamic elements is particularly important for validating the temporal reasoning capabilities, especially in understanding traffic participant intentions within the scene and executing the nuanced spatio-temporal reasoning required for safe navigation.
30 |
31 | - **Homogeneous Data**:
32 | Existing VLM-based AD datasets often suffer from a lack
33 | of diversity, which limits the ability to test models across
34 | a wide range of real-world scenarios. The narrow results
35 | restrict the evaluation of zero-shot generalization and the
36 | performance on challenging corner cases.
37 |
38 | We introduce **VLADBench**, specifically designed to rigorously evaluate the capabilities of VLMs in AD. VLADBench employes a hierarchical structure that reflects the complex skill set required for reliable driving, progressing from fundamental scene and traffic elements comprehension to advanced reasoning and decision-making.
39 |
40 | - With 2000 static scenes and 3000 dynamic scenarios, VLADBench spans 5 primary domains: Traffic Knowledge Understanding (TKU), General Element Recognition (GER), Traffic Graph Generation (TGG), Target Attribute Comprehension (TAC), and Ego Decision-making and Planning(EDP). For a more detailed assessment, 11 secondary aspects and 29 tertiary tasks are defined, resulting in a total of 12K questions.
41 | - VLADBench is built from existing publicly available datasets, meticulously curated through a manual selection across 12 sources, aimed at challenging VLM capabilities in diverse challenging driving situations.
42 | - To further investigate the intersections among the 5 key domains, we collect and construct approximately 1.4M AD-specific QAs from public resources. We then categorize these QAs
43 | using GPT-4 and train models on individual domain-specific (DS) datasets. Finally, we validate the trained models on VLADBench to assess their performance across different domains.
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 | # Examples
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 | # Evaluation Results
60 |
61 |
62 |
63 |
64 | For the detailed results of each tasks and the results from the large-scale models, please see the paper.
65 |
66 |
67 | # Evaluation Pipeine
68 |
69 | 1. We provide a test example based on Qwen2-VL.[qwen2vl_all.py]
70 | 2. Run evaluate_vlm.py for scores.
71 |
72 | **Note**
73 | **1. The bounding boxes in VLADBench are NOT resized. You should modify them in [prompt](https://github.com/Depth2World/VLADBench/blob/8156375d6b0e88ec5ca4f5a8119a9a0ee6c1ed18/qwen2vl_all.py#L79) and [evaluation](https://github.com/Depth2World/VLADBench/blob/8156375d6b0e88ec5ca4f5a8119a9a0ee6c1ed18/evaluate_utils.py#L170) for different VLMs.**
74 | **2. The finall scores do NOT include the trajectory evaluation.**
75 |
--------------------------------------------------------------------------------
/all_task.json:
--------------------------------------------------------------------------------
1 | {
2 | "Traffic_Knowledge_Understanding": {
3 | "Road_Traffic_Signals": [
4 | "Traffic_Light",
5 | "Pavement_Marking",
6 | "Traffic_Sign"
7 | ],
8 | "Road_Passage_Provisions": [
9 | "Right_Of_Way"
10 | ]
11 | },
12 | "General_Element_Recognition": {
13 | "Foreground": [
14 | "VRU_Recognition",
15 | "Vehicle_Recognition",
16 | "Vehicle_Status",
17 | "Lane_Recognition",
18 | "Obstruction_Recognition"
19 | ],
20 | "Background": [
21 | "Light",
22 | "Weather"
23 | ]
24 | },
25 | "Traffic_Graph_Generation": {
26 | "Signal_Element_Relation": [
27 | "Sign_Sign_Relation",
28 | "Sign_Lane_Relation",
29 | "Light_Lane_Relation"
30 | ],
31 | "Lane_Element_Relation": [
32 | "Lane_Speed_Relation",
33 | "Lane_Change_Relation"
34 | ]
35 | },
36 | "Target_Attribute_Comprehension": {
37 | "Intention_Judgment": [
38 | "VRU_Cutin",
39 | "Vehicle_Cutin",
40 | "VRU_Cross",
41 | "Long_Short_Parking"
42 | ],
43 | "Behavior_Understanding": [
44 | "Vehicle_Bahavior",
45 | "VRU_Bahavior"
46 | ]
47 | },
48 | "Ego_Decision_Planning": {
49 | "Ego_Action_Reasoning": [
50 | "Key_Obsturction_Detection",
51 | "Spatial_Temporal_Reasoning",
52 | "Risk_Prediction",
53 | "Drive_Efficiency"
54 | ],
55 | "Meta_Action_Decision": [
56 | "Longitudinal",
57 | "Lateral"
58 | ],
59 | "Ego_trajectory_Planning":[
60 | "Trajectory"
61 | ]
62 | }
63 | }
64 |
--------------------------------------------------------------------------------
/asset/domain_123.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/domain_123.jpeg
--------------------------------------------------------------------------------
/asset/domain_45.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/domain_45.jpeg
--------------------------------------------------------------------------------
/asset/domain_45.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/domain_45.png
--------------------------------------------------------------------------------
/asset/key_domains.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/key_domains.png
--------------------------------------------------------------------------------
/asset/results_tra.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/results_tra.png
--------------------------------------------------------------------------------
/asset/results_vlms.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/results_vlms.png
--------------------------------------------------------------------------------
/asset/sun.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/sun.jpeg
--------------------------------------------------------------------------------
/evaluate_utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import re
4 | import numpy as np
5 |
6 |
7 |
8 | def weighted_row_sum(data, third_rows, weight_col=1, start_col=2):
9 |
10 | data = np.array(data)
11 | m,n = data.shape
12 | rows = slice(m-third_rows, m)
13 | cols = slice(start_col, None)
14 | weighted_sum = np.sum(data[rows, cols].astype(float) * data[rows, weight_col].astype(float)[:, np.newaxis], axis=0) / np.sum(data[rows, weight_col].astype(float))
15 | weighted_sum = ['Mean',np.sum(data[rows, weight_col].astype(float))] + weighted_sum.tolist()
16 | temp = data.tolist()
17 | temp.append(weighted_sum)
18 | return temp
19 |
20 |
21 |
22 | def weighted_total(data, weight_col=1, start_col=2):
23 | data = np.array(data)
24 | m,n = data.shape
25 | rows = slice(0, m)
26 | cols = slice(start_col, None)
27 | weighted_sum = np.sum(data[rows, cols].astype(float) * data[rows, weight_col].astype(float)[:, np.newaxis], axis=0) / np.sum(data[rows, weight_col].astype(float))
28 | weighted_sum = ['Total',np.sum(data[rows, weight_col].astype(float))] + weighted_sum.tolist()
29 | return weighted_sum
30 |
31 |
32 | def box_iou(boxA, boxB):
33 | boxA = [int(x) for x in boxA]
34 | boxB = [int(x) for x in boxB]
35 |
36 | xA = max(boxA[0], boxB[0])
37 | xB = min(boxA[2], boxB[2])
38 | yA = max(boxA[1], boxB[1])
39 | yB = min(boxA[3], boxB[3])
40 |
41 | interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)
42 |
43 | boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
44 | boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)
45 |
46 | iou = interArea / float(boxAArea + boxBArea - interArea)
47 |
48 | return iou
49 | def clean_string(s):
50 | while s and (s[0] in ":[]()' ."):
51 | s = s[1:]
52 | while s and (s[-1] in ":[]()' ."):
53 | s = s[:-1]
54 | return s
55 |
56 | def convert_if_number(answer):
57 | if isinstance(answer, (int, float)):
58 | return str(answer)
59 | return answer
60 |
61 | def remove_symbols(input_string):
62 | if 'correct answer is:' in input_string:
63 | input_string = input_string.split('correct answer is:')[-1]
64 | cleaned_string = re.sub(r'[\*\n\""]', '', input_string)
65 | return cleaned_string
66 |
67 | def extract_options(text):
68 |
69 | pattern = re.compile(r"\[([^\]]+)\]")
70 | matches = pattern.findall(text)
71 |
72 | if matches:
73 | option_string = matches[-1]
74 | if "'" not in option_string:
75 | option_list = option_string.split(", ")
76 | else:
77 | option_list = [item.strip().strip("'") for item in option_string.split("', '")]
78 | return option_list
79 | return []
80 |
81 |
82 | def compare_and_count(array_a, array_b):
83 | count = 0
84 | for a, b in zip(array_a, array_b):
85 | if a == 1 and b == 1: count+=1
86 | if a > b:count+=1
87 | return count
88 |
89 | def isfile(path):
90 | return os.path.isfile(path)
91 |
92 |
93 | def load_json_data(path):
94 | with open(path,'r',encoding='utf-8') as json_f:
95 | task_data = json.load(json_f)
96 | return task_data
97 |
98 | def save_json_data(path,data):
99 | with open(path,'w',encoding='utf-8') as json_f:
100 | json.dump(data,json_f,ensure_ascii=False,indent=4)
101 |
102 | def Geneal_criterion_QA(third_task_data,MODEL=None):
103 | ques_total_num = 0
104 | right_num = 0
105 | obey_insytruction = 0
106 | for d_ind, sample in enumerate(third_task_data):
107 | reference = sample['reference']
108 | prediction = sample['prediction']
109 | for q_ind, pred in enumerate(prediction):
110 | # print(sample['image_path'])
111 | ques_nopath = ''.join(sample['questions'][q_ind].lower().split(';')[1:])
112 | tips = extract_options(ques_nopath)
113 | # print(tips)
114 | if len(tips)==0: pass
115 | # if len(tips)!=0: print('No tips',sample['image_path'])
116 | # print('No tips',sample['image_path'])
117 | # print(ques_nopath)
118 | pred = remove_symbols(pred)
119 | ques_total_num += 1
120 | clean_pred = clean_string(pred).lower()
121 | options_nums = clean_pred.split("', '")
122 | reference_q_ind = convert_if_number(reference[q_ind]).lower()
123 | if len(options_nums)==1:
124 | if clean_pred in ques_nopath:
125 | obey_insytruction+=1
126 | if clean_pred==reference_q_ind:
127 | right_num+=1
128 | elif reference_q_ind in clean_pred:
129 | ### filter
130 | if reference_q_ind in tips:
131 | tips.remove(reference_q_ind)
132 | if not any(tip in clean_pred for tip in tips):
133 | right_num+=1
134 | return ques_total_num,right_num/ques_total_num,obey_insytruction/ques_total_num,0
135 |
136 |
137 | def Grounding_criterion_QA(third_task_data,MODEL=None):
138 | if MODEL ==None:
139 | print('MODEL Input Lacked')
140 | return -1
141 | resize_model_lists = ["qwen", "internvl", "gemini","DriveMM",'ivl']
142 | ques_total_num = 0
143 | right_num = 0
144 | loc_union = []
145 | obey_insytruction = 0
146 | PATTERN = re.compile(r'\[\s*([^\],]*\d+[^\],]*)\s*,\s*([^\],]*\d+[^\],]*)\s*,\s*([^\],]*\d+[^\],]*)\s*,\s*([^\],]*\d+[^\],]*)\s*\]')
147 | box_num = 0
148 | for d_ind, sample in enumerate(third_task_data):
149 | reference = sample['reference']
150 | prediction = sample['prediction']
151 | for q_ind, pred in enumerate(prediction):
152 | # print(sample['image_path'])
153 | ques_total_num += 1
154 | ques_nopath = ''.join(sample['questions'][q_ind].lower().split(';')[1:])
155 | if 'located in the image?' in ques_nopath:
156 | matches = PATTERN.findall(pred)
157 | cleaned_matches = [[float(re.sub(r'[^0-9.]', '', part)) for part in match] for match in matches]
158 | if len(matches)==1:
159 | box_num+=1
160 | obey_insytruction+=1
161 | predict_bbox = cleaned_matches[0]
162 | else:
163 | predict_bbox = [0.0, 0.0, 0.0, 0.0]
164 |
165 | if sum(predict_bbox) <4:
166 | predict_bbox = [x * 1000 for x in predict_bbox]
167 | if any(mn.lower() in MODEL.lower() for mn in resize_model_lists):
168 | bbox_gt = sample['reference'][q_ind]
169 | width,height = sample['dimension']
170 | bbox_gt = [int(1000*bbox_gt[0]/width), int(1000*bbox_gt[1]/height), int(1000*bbox_gt[2]/width), int(1000*bbox_gt[3]/height)]
171 | elif MODEL =="gemini":
172 | bbox_gt = [bbox_gt[1], bbox_gt[0], bbox_gt[3], bbox_gt[2]]
173 | else:
174 | bbox_gt = sample['reference'][q_ind]
175 |
176 | iou = box_iou(predict_bbox, bbox_gt)
177 | if iou > 0.5: right_num+=1
178 | loc_union.append(iou)
179 | else:
180 | tips = extract_options(ques_nopath)
181 | # if len(tips)==0:
182 | # print('No tips',sample['image_path'])
183 | # print(sample['questions'][q_ind])
184 | pred = remove_symbols(pred)
185 | clean_pred = clean_string(pred).lower()
186 | options_nums = clean_pred.split("', '")
187 | reference_q_ind = convert_if_number(reference[q_ind]).lower()
188 | if len(options_nums)==1:
189 | if clean_pred in ques_nopath:
190 | obey_insytruction+=1
191 | if clean_pred==reference_q_ind:
192 | right_num+=1
193 |
194 | elif reference_q_ind in clean_pred:
195 | ### filter
196 | if reference_q_ind in tips:
197 | tips.remove(reference_q_ind)
198 | if not any(tip in clean_pred for tip in tips):
199 | right_num+=1
200 |
201 | mean_iou = sum(loc_union)/len(loc_union)
202 | return ques_total_num, right_num/ques_total_num, obey_insytruction/ques_total_num, mean_iou
203 |
204 |
205 |
206 | def Relation_criterion_QA(third_task_data,MODEL=None):
207 | ques_total_num = 0
208 | total_score = 0
209 | obey_insytruction = 0
210 | totol_improve_score = 0
211 | for d_ind, sample in enumerate(third_task_data):
212 | reference = sample['reference']
213 | prediction = sample['prediction']
214 | scores_list = []
215 | for q_ind, pred in enumerate(prediction):
216 | ques_total_num+=1
217 | if 'corresponds to' in pred:
218 | # pattern = r'(?