├── README.md ├── all_task.json ├── asset ├── domain_123.jpeg ├── domain_45.jpeg ├── domain_45.png ├── key_domains.png ├── results_tra.png ├── results_vlms.png └── sun.jpeg ├── evaluate_utils.py ├── evaluate_vlm.py └── qwen2vl_all.py /README.md: -------------------------------------------------------------------------------- 1 | # ✨✨VLADBench✨✨: Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving 2 | 3 |

4 | 5 | ![VLADBench](https://img.shields.io/badge/Dataset-VLADBench-blue) 6 | ![VQA](https://img.shields.io/badge/Task-Autonomous--Driving--QA-red) 7 | ![Multi-Modal](https://img.shields.io/badge/Modal-Image/Video-red) 8 | ![Models](https://img.shields.io/badge/Models-Open--Source-green) 9 | ![Models](https://img.shields.io/badge/Models-Closed--Source-green) 10 | ![Models](https://img.shields.io/badge/Models-Domain--Specified-green) 11 | 12 |

13 |

14 | 📖 arXiv Paper 15 | 🤗 Dataset 16 |

17 | 18 | 19 | 20 | # Overview 21 | 22 | Current benchmarks designed for VLM-based AD face several notable limitations: 23 | 24 | - **Coarse-grained Categories**: The underlying datasets of 25 | the VLM-based models are often simplistic, typically categorizing tasks into perception, prediction, and planning 26 | with reasoning, which are incomplete for evaluating the nuanced cognitive and reasoning abilities required for safe and 27 | reliable AD. 28 | 29 | - **Lack of Dynamic Elements**: Both static and dynamic scenes are crucial for evaluating AD systems, a robust analysis of dynamic elements is particularly important for validating the temporal reasoning capabilities, especially in understanding traffic participant intentions within the scene and executing the nuanced spatio-temporal reasoning required for safe navigation. 30 | 31 | - **Homogeneous Data**: 32 | Existing VLM-based AD datasets often suffer from a lack 33 | of diversity, which limits the ability to test models across 34 | a wide range of real-world scenarios. The narrow results 35 | restrict the evaluation of zero-shot generalization and the 36 | performance on challenging corner cases. 37 | 38 | We introduce **VLADBench**, specifically designed to rigorously evaluate the capabilities of VLMs in AD. VLADBench employes a hierarchical structure that reflects the complex skill set required for reliable driving, progressing from fundamental scene and traffic elements comprehension to advanced reasoning and decision-making. 39 | 40 | - With 2000 static scenes and 3000 dynamic scenarios, VLADBench spans 5 primary domains: Traffic Knowledge Understanding (TKU), General Element Recognition (GER), Traffic Graph Generation (TGG), Target Attribute Comprehension (TAC), and Ego Decision-making and Planning(EDP). For a more detailed assessment, 11 secondary aspects and 29 tertiary tasks are defined, resulting in a total of 12K questions. 41 | - VLADBench is built from existing publicly available datasets, meticulously curated through a manual selection across 12 sources, aimed at challenging VLM capabilities in diverse challenging driving situations. 42 | - To further investigate the intersections among the 5 key domains, we collect and construct approximately 1.4M AD-specific QAs from public resources. We then categorize these QAs 43 | using GPT-4 and train models on individual domain-specific (DS) datasets. Finally, we validate the trained models on VLADBench to assess their performance across different domains. 44 | 45 |

46 | 47 |

48 | 49 | 50 | 51 | # Examples 52 |

53 | 54 |

55 |

56 | 57 |

58 | 59 | # Evaluation Results 60 |

61 | 62 |

63 | 64 | For the detailed results of each tasks and the results from the large-scale models, please see the paper. 65 | 66 | 67 | # Evaluation Pipeine 68 | 69 | 1. We provide a test example based on Qwen2-VL.[qwen2vl_all.py] 70 | 2. Run evaluate_vlm.py for scores. 71 | 72 | **Note** 73 | **1. The bounding boxes in VLADBench are NOT resized. You should modify them in [prompt](https://github.com/Depth2World/VLADBench/blob/8156375d6b0e88ec5ca4f5a8119a9a0ee6c1ed18/qwen2vl_all.py#L79) and [evaluation](https://github.com/Depth2World/VLADBench/blob/8156375d6b0e88ec5ca4f5a8119a9a0ee6c1ed18/evaluate_utils.py#L170) for different VLMs.** 74 | **2. The finall scores do NOT include the trajectory evaluation.** 75 | -------------------------------------------------------------------------------- /all_task.json: -------------------------------------------------------------------------------- 1 | { 2 | "Traffic_Knowledge_Understanding": { 3 | "Road_Traffic_Signals": [ 4 | "Traffic_Light", 5 | "Pavement_Marking", 6 | "Traffic_Sign" 7 | ], 8 | "Road_Passage_Provisions": [ 9 | "Right_Of_Way" 10 | ] 11 | }, 12 | "General_Element_Recognition": { 13 | "Foreground": [ 14 | "VRU_Recognition", 15 | "Vehicle_Recognition", 16 | "Vehicle_Status", 17 | "Lane_Recognition", 18 | "Obstruction_Recognition" 19 | ], 20 | "Background": [ 21 | "Light", 22 | "Weather" 23 | ] 24 | }, 25 | "Traffic_Graph_Generation": { 26 | "Signal_Element_Relation": [ 27 | "Sign_Sign_Relation", 28 | "Sign_Lane_Relation", 29 | "Light_Lane_Relation" 30 | ], 31 | "Lane_Element_Relation": [ 32 | "Lane_Speed_Relation", 33 | "Lane_Change_Relation" 34 | ] 35 | }, 36 | "Target_Attribute_Comprehension": { 37 | "Intention_Judgment": [ 38 | "VRU_Cutin", 39 | "Vehicle_Cutin", 40 | "VRU_Cross", 41 | "Long_Short_Parking" 42 | ], 43 | "Behavior_Understanding": [ 44 | "Vehicle_Bahavior", 45 | "VRU_Bahavior" 46 | ] 47 | }, 48 | "Ego_Decision_Planning": { 49 | "Ego_Action_Reasoning": [ 50 | "Key_Obsturction_Detection", 51 | "Spatial_Temporal_Reasoning", 52 | "Risk_Prediction", 53 | "Drive_Efficiency" 54 | ], 55 | "Meta_Action_Decision": [ 56 | "Longitudinal", 57 | "Lateral" 58 | ], 59 | "Ego_trajectory_Planning":[ 60 | "Trajectory" 61 | ] 62 | } 63 | } 64 | -------------------------------------------------------------------------------- /asset/domain_123.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/domain_123.jpeg -------------------------------------------------------------------------------- /asset/domain_45.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/domain_45.jpeg -------------------------------------------------------------------------------- /asset/domain_45.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/domain_45.png -------------------------------------------------------------------------------- /asset/key_domains.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/key_domains.png -------------------------------------------------------------------------------- /asset/results_tra.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/results_tra.png -------------------------------------------------------------------------------- /asset/results_vlms.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/results_vlms.png -------------------------------------------------------------------------------- /asset/sun.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Depth2World/VLADBench/fc9e407b9c859126320e9527d287135b5bb0f368/asset/sun.jpeg -------------------------------------------------------------------------------- /evaluate_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import re 4 | import numpy as np 5 | 6 | 7 | 8 | def weighted_row_sum(data, third_rows, weight_col=1, start_col=2): 9 | 10 | data = np.array(data) 11 | m,n = data.shape 12 | rows = slice(m-third_rows, m) 13 | cols = slice(start_col, None) 14 | weighted_sum = np.sum(data[rows, cols].astype(float) * data[rows, weight_col].astype(float)[:, np.newaxis], axis=0) / np.sum(data[rows, weight_col].astype(float)) 15 | weighted_sum = ['Mean',np.sum(data[rows, weight_col].astype(float))] + weighted_sum.tolist() 16 | temp = data.tolist() 17 | temp.append(weighted_sum) 18 | return temp 19 | 20 | 21 | 22 | def weighted_total(data, weight_col=1, start_col=2): 23 | data = np.array(data) 24 | m,n = data.shape 25 | rows = slice(0, m) 26 | cols = slice(start_col, None) 27 | weighted_sum = np.sum(data[rows, cols].astype(float) * data[rows, weight_col].astype(float)[:, np.newaxis], axis=0) / np.sum(data[rows, weight_col].astype(float)) 28 | weighted_sum = ['Total',np.sum(data[rows, weight_col].astype(float))] + weighted_sum.tolist() 29 | return weighted_sum 30 | 31 | 32 | def box_iou(boxA, boxB): 33 | boxA = [int(x) for x in boxA] 34 | boxB = [int(x) for x in boxB] 35 | 36 | xA = max(boxA[0], boxB[0]) 37 | xB = min(boxA[2], boxB[2]) 38 | yA = max(boxA[1], boxB[1]) 39 | yB = min(boxA[3], boxB[3]) 40 | 41 | interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1) 42 | 43 | boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1) 44 | boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1) 45 | 46 | iou = interArea / float(boxAArea + boxBArea - interArea) 47 | 48 | return iou 49 | def clean_string(s): 50 | while s and (s[0] in ":[]()' ."): 51 | s = s[1:] 52 | while s and (s[-1] in ":[]()' ."): 53 | s = s[:-1] 54 | return s 55 | 56 | def convert_if_number(answer): 57 | if isinstance(answer, (int, float)): 58 | return str(answer) 59 | return answer 60 | 61 | def remove_symbols(input_string): 62 | if 'correct answer is:' in input_string: 63 | input_string = input_string.split('correct answer is:')[-1] 64 | cleaned_string = re.sub(r'[\*\n\""]', '', input_string) 65 | return cleaned_string 66 | 67 | def extract_options(text): 68 | 69 | pattern = re.compile(r"\[([^\]]+)\]") 70 | matches = pattern.findall(text) 71 | 72 | if matches: 73 | option_string = matches[-1] 74 | if "'" not in option_string: 75 | option_list = option_string.split(", ") 76 | else: 77 | option_list = [item.strip().strip("'") for item in option_string.split("', '")] 78 | return option_list 79 | return [] 80 | 81 | 82 | def compare_and_count(array_a, array_b): 83 | count = 0 84 | for a, b in zip(array_a, array_b): 85 | if a == 1 and b == 1: count+=1 86 | if a > b:count+=1 87 | return count 88 | 89 | def isfile(path): 90 | return os.path.isfile(path) 91 | 92 | 93 | def load_json_data(path): 94 | with open(path,'r',encoding='utf-8') as json_f: 95 | task_data = json.load(json_f) 96 | return task_data 97 | 98 | def save_json_data(path,data): 99 | with open(path,'w',encoding='utf-8') as json_f: 100 | json.dump(data,json_f,ensure_ascii=False,indent=4) 101 | 102 | def Geneal_criterion_QA(third_task_data,MODEL=None): 103 | ques_total_num = 0 104 | right_num = 0 105 | obey_insytruction = 0 106 | for d_ind, sample in enumerate(third_task_data): 107 | reference = sample['reference'] 108 | prediction = sample['prediction'] 109 | for q_ind, pred in enumerate(prediction): 110 | # print(sample['image_path']) 111 | ques_nopath = ''.join(sample['questions'][q_ind].lower().split(';')[1:]) 112 | tips = extract_options(ques_nopath) 113 | # print(tips) 114 | if len(tips)==0: pass 115 | # if len(tips)!=0: print('No tips',sample['image_path']) 116 | # print('No tips',sample['image_path']) 117 | # print(ques_nopath) 118 | pred = remove_symbols(pred) 119 | ques_total_num += 1 120 | clean_pred = clean_string(pred).lower() 121 | options_nums = clean_pred.split("', '") 122 | reference_q_ind = convert_if_number(reference[q_ind]).lower() 123 | if len(options_nums)==1: 124 | if clean_pred in ques_nopath: 125 | obey_insytruction+=1 126 | if clean_pred==reference_q_ind: 127 | right_num+=1 128 | elif reference_q_ind in clean_pred: 129 | ### filter 130 | if reference_q_ind in tips: 131 | tips.remove(reference_q_ind) 132 | if not any(tip in clean_pred for tip in tips): 133 | right_num+=1 134 | return ques_total_num,right_num/ques_total_num,obey_insytruction/ques_total_num,0 135 | 136 | 137 | def Grounding_criterion_QA(third_task_data,MODEL=None): 138 | if MODEL ==None: 139 | print('MODEL Input Lacked') 140 | return -1 141 | resize_model_lists = ["qwen", "internvl", "gemini","DriveMM",'ivl'] 142 | ques_total_num = 0 143 | right_num = 0 144 | loc_union = [] 145 | obey_insytruction = 0 146 | PATTERN = re.compile(r'\[\s*([^\],]*\d+[^\],]*)\s*,\s*([^\],]*\d+[^\],]*)\s*,\s*([^\],]*\d+[^\],]*)\s*,\s*([^\],]*\d+[^\],]*)\s*\]') 147 | box_num = 0 148 | for d_ind, sample in enumerate(third_task_data): 149 | reference = sample['reference'] 150 | prediction = sample['prediction'] 151 | for q_ind, pred in enumerate(prediction): 152 | # print(sample['image_path']) 153 | ques_total_num += 1 154 | ques_nopath = ''.join(sample['questions'][q_ind].lower().split(';')[1:]) 155 | if 'located in the image?' in ques_nopath: 156 | matches = PATTERN.findall(pred) 157 | cleaned_matches = [[float(re.sub(r'[^0-9.]', '', part)) for part in match] for match in matches] 158 | if len(matches)==1: 159 | box_num+=1 160 | obey_insytruction+=1 161 | predict_bbox = cleaned_matches[0] 162 | else: 163 | predict_bbox = [0.0, 0.0, 0.0, 0.0] 164 | 165 | if sum(predict_bbox) <4: 166 | predict_bbox = [x * 1000 for x in predict_bbox] 167 | if any(mn.lower() in MODEL.lower() for mn in resize_model_lists): 168 | bbox_gt = sample['reference'][q_ind] 169 | width,height = sample['dimension'] 170 | bbox_gt = [int(1000*bbox_gt[0]/width), int(1000*bbox_gt[1]/height), int(1000*bbox_gt[2]/width), int(1000*bbox_gt[3]/height)] 171 | elif MODEL =="gemini": 172 | bbox_gt = [bbox_gt[1], bbox_gt[0], bbox_gt[3], bbox_gt[2]] 173 | else: 174 | bbox_gt = sample['reference'][q_ind] 175 | 176 | iou = box_iou(predict_bbox, bbox_gt) 177 | if iou > 0.5: right_num+=1 178 | loc_union.append(iou) 179 | else: 180 | tips = extract_options(ques_nopath) 181 | # if len(tips)==0: 182 | # print('No tips',sample['image_path']) 183 | # print(sample['questions'][q_ind]) 184 | pred = remove_symbols(pred) 185 | clean_pred = clean_string(pred).lower() 186 | options_nums = clean_pred.split("', '") 187 | reference_q_ind = convert_if_number(reference[q_ind]).lower() 188 | if len(options_nums)==1: 189 | if clean_pred in ques_nopath: 190 | obey_insytruction+=1 191 | if clean_pred==reference_q_ind: 192 | right_num+=1 193 | 194 | elif reference_q_ind in clean_pred: 195 | ### filter 196 | if reference_q_ind in tips: 197 | tips.remove(reference_q_ind) 198 | if not any(tip in clean_pred for tip in tips): 199 | right_num+=1 200 | 201 | mean_iou = sum(loc_union)/len(loc_union) 202 | return ques_total_num, right_num/ques_total_num, obey_insytruction/ques_total_num, mean_iou 203 | 204 | 205 | 206 | def Relation_criterion_QA(third_task_data,MODEL=None): 207 | ques_total_num = 0 208 | total_score = 0 209 | obey_insytruction = 0 210 | totol_improve_score = 0 211 | for d_ind, sample in enumerate(third_task_data): 212 | reference = sample['reference'] 213 | prediction = sample['prediction'] 214 | scores_list = [] 215 | for q_ind, pred in enumerate(prediction): 216 | ques_total_num+=1 217 | if 'corresponds to' in pred: 218 | # pattern = r'(?