├── LICENSE ├── README.md └── assets ├── llm4adpipeline.png └── whyllmenhance.png /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-LLM-for-Autonomous-Driving-Resources 2 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)![GitHub stars](https://img.shields.io/github/stars/Thinklab-SJTU/Awesome-LLM4AD?color=yellow) ![GitHub forks](https://img.shields.io/github/forks/Thinklab-SJTU/Awesome-LLM4AD?color=9cf) [![GitHub license](https://img.shields.io/github/license/Thinklab-SJTU/Awesome-LLM4AD)](https://github.com/Thinklab-SJTU/Awesome-LLM4AD/blob/main/LICENSE) 3 | 4 | This is a collection of research papers about **LLM-for-Autonomous-Driving(LLM4AD)**. 5 | And the repository will be continuously updated to track the frontier of LLM4AD. *Maintained by SJTU-ReThinklab.* 6 | 7 | Welcome to follow and star! If you find any related materials could be helpful, feel free to contact us (yangzhenjie@sjtu.edu.cn or jiaxiaosong@sjtu.edu.cn) or make a PR. 8 | 9 | ## Citation 10 | Our survey paper is at https://arxiv.org/abs/2311.01043 which includes more detailed discussions and will be continuously updated. 11 | **The latest version was updated on August 12, 2024.** 12 | 13 | If you find our repo is helpful, please consider cite it. 14 | ```BibTeX 15 | @misc{yang2023survey, 16 | title={LLM4Drive: A Survey of Large Language Models for Autonomous Driving}, 17 | author={Zhenjie Yang and Xiaosong Jia and Hongyang Li and Junchi Yan}, 18 | year={2023}, 19 | eprint={2311.01043}, 20 | archivePrefix={arXiv}, 21 | primaryClass={cs.AI} 22 | } 23 | ``` 24 | 25 | 26 | ## Table of Contents 27 | - [Awesome LLM-for-Autonomous-Driving(LLM4AD)](#awesome-llm-for-autonomous-driving-resources) 28 | - [Table of Contents](#table-of-contents) 29 | - [Overview of LLM4AD](#overview-of-llm4ad) 30 | - [ICLR 2024 Under Review](#iclr-2024-under-review) 31 | - [Papers](#papers) 32 | - [Datasets](#datasets) 33 | - [Citation](#citation) 34 | - [License](#license) 35 | 36 | ## Overview of LLM4AD 37 | LLM-for-Autonomous-Driving (LLM4AD) refers to the application of Large Language Models(LLMs) in autonomous driving. We divide existing works based on the perspective of applying LLMs: planning, perception, question answering, and generation. 38 | 39 | ![image info](./assets/llm4adpipeline.png) 40 | 41 | ## Motivation of LLM4AD 42 | The orange circle represents the ideal level of driving competence, akin to that possessed by an experienced human driver. There are two main methods to acquire such proficiency: one, through learning-based techniques within simulated environments; and two, by learning from offline data through similar methodologies. It’s important to note that due to discrepancies between simulations and the real-world, these two domains are not fully the same, i.e. sim2real gap. Concurrently, offline data serves as a subset of real-world data since it’s collected directly from actual surroundings. However, it is difficult to fully cover the distribution as well due to the notorious long-tailed nature of autonomous driving tasks. The final goal of autonomous driving is to elevate driving abilities from a basic green stage to a more advanced blue level through extensive data collection and deep learning. 43 | 44 | ![image info](./assets/whyllmenhance.png) 45 | 46 | ## ICLR 2024 Open Review 47 |
48 | Toggle 49 | 50 | ``` 51 | format: 52 | - [title](paper link) [links] 53 | - task 54 | - keyword 55 | - code or project page 56 | - datasets or environment or simulator 57 | - summary 58 | ``` 59 | - [Large Language Models as Decision Makers for Autonomous Driving](https://openreview.net/forum?id=NkYCuGM7E2) 60 | - Keywords: Large language model, Autonomous driving 61 | - [Previous summary](#LanguageMPC) 62 | 63 | - [DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model](https://openreview.net/forum?id=DUkYDXqxKp) 64 | - Keywords: Interpretable autonomous driving, large language model, robotics, computer vision 65 | - [Previous summary](#DriveGPT4) 66 | 67 | - [BEV-CLIP: Multi-modal BEV Retrieval Methodology for Complex Scene in Autonomous Driving](https://openreview.net/forum?id=wlqkRFRkYc) 68 | - Keywords: Autonomous Driving, BEV, Retrieval, Multi-modal, LLM, prompt learning 69 | - Task: Contrastive learning, Retrieval tasks 70 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 71 | - Summary: 72 | - Propose a multimodal retrieval method powered by LLM and knowledge graph to achieve contrastive learning between text description and BEV feature retrieval for autonomous driving. 73 | 74 | - [GPT-Driver: Learning to Drive with GPT](https://openreview.net/forum?id=SXMTK2eltf) 75 | - Keywords: Motion Planning, Autonomous Driving, Large Language Models (LLMs), GPT 76 | - [Previous summary](#GPT-Driver) 77 | 78 | - [Radar Spectra-language Model for Automotive Scene Parsing](https://openreview.net/forum?id=bdJaYLiOxi) 79 | - Keywords: radar spectra, radar perception, radar object detection, free space segmentation, autonomous driving, radar classification 80 | - Task: Detection 81 | - Datasets: 82 | - [RADIal](https://github.com/valeoai/RADIal), [CRUW](https://www.cruwdataset.org/), [nuScenes](https://www.nuscenes.org/nuscenes) 83 | - For RADIal and CRUW, both images and ground truth labels are used. From nuScenes, only images are taken. 84 | - Random captions for a frame from CRUW dataset based on ground truth object positions and pseudo ground-truth classes. (not open) 85 | - Summary: 86 | - Conduct a benchmark comparison of off-the-shelf vision-language models (VLMs) for classification in automotive scenes. 87 | - Propose to fine-tune a large VLM specially for automated driving scenes. 88 | 89 | - [GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation](https://openreview.net/forum?id=xBfQZWeDRH) 90 | - Keywords: diffusion model, controllable generation, object detection, autonomous driving 91 | - Task: Detection, Data generation 92 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes), which consists of 60K training samples and 15K validation samples with high-quality bounding box annotations from 10 semantic classes. 93 | - Summary: 94 | - propose GEODIFFUSION, an embarrassing simple framework to integrate geometric controls into pre-trained diffusion models for detection data generation via text prompts. 95 | 96 | - [SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving](https://openreview.net/forum?id=9zEBK3E9bX) 97 | - Keywords: 3D pre-training, object detection, autonomous driving 98 | - Task: Detection 99 | - Summary: 100 | - propose SPOT a scalable 3D pre-training paradigm for LiDAR pretraining. 101 | 102 | - [3D DENSE CAPTIONING BEYOND NOUNS: A MIDDLEWARE FOR AUTONOMOUS DRIVING](https://openreview.net/forum?id=8T7m27VC3S) 103 | - Keywords: Autonomous Driving, Dense Captioning, Foundation model 104 | - Task: Caption, Dataset Construction 105 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 106 | - Summary: 107 | - Design a scalable rule-based auto-labelling methodology to generate 3D dense captioning. 108 | - Construct a large-scale dataset nuDesign based upon nuScenes, which consists of an unprecedented number of 2300k sentences. 109 |
110 | 111 | 112 | ## Papers 113 |
114 | Toggle 115 | 116 | ``` 117 | format: 118 | - [title](paper link) [links] 119 | - author1, author2, and author3... 120 | - publisher 121 | - task 122 | - keyword 123 | - code or project page 124 | - datasets or environment or simulator 125 | - publish date 126 | - summary 127 | - metrics 128 | ``` 129 | - [LeGEND: A Top-Down Approach to Scenario Generation of Autonomous Driving Systems Assisted by Large Language Models](https://arxiv.org/abs/2409.10066) 130 | - Shuncheng Tang, Zhenya Zhang, Jixiang Zhou, Lei Lei, Yuan Zhou, Yinxing Xue **ASE 2024** 131 | - Publisher: University of Science and Technology of China, Kyushu University, Zhejiang Sci-Tech University 132 | - Task: Generation 133 | - Publish Date: 2024.09.16 134 | - Code: [LeGEND](https://github.com/MayDGT/LeGEND) 135 | - Summary: 136 | - LeGEND, a top-down scenario generation approach that can achieve both criticality and diversity of scenarios. 137 | - Devise a two-stage transformation, by using an intermediate language, from accident reports to logical scenarios; so, LeGEND involves two LLMs, each in charge of one different stage. 138 | - Implement LeGEND and demonstrate its effectiveness on Apollo, and we detect 11 types of critical concrete scenarios that reflect different aspects of system defects. 139 | 140 | - [MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving](https://arxiv.org/abs/2409.07267) 141 | - Enming Zhang, Xingyuan Dai, Yisheng Lv, Qinghai Miao 142 | - Publisher: University of Chinese Academy of Sciences, CASIA 143 | - Task: QA 144 | - Publish Date: 2024.09.14 145 | - Code: [MiniDrive](https://github.com/EMZucas/minidrive) 146 | - Summary: 147 | - MiniDrive addresses the challenges of efficient deployment and real-time response in VLMs for autonomous driving systems. It can be fully trained simultaneously on an 148 | RTX 4090 GPU with 24GB of memory. 149 | - Feature Engineering Mixture of Experts (FE-MoE) addresses the challenge of efficiently encoding 2D features from multiple perspectives into text token embeddings, effectively reducing the number of visual feature tokens and minimizing feature redundancy. 150 | - Dynamic Instruction Adapter through a residual structure, which addresses the problem of fixed visual tokens for the same image before being input into the language model. 151 | 152 | - [OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving](https://arxiv.org/abs/2409.03272) 153 | - Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, Wenchao Ding 154 | - Publisher: Fudan University, Tsinghua University 155 | - Task: Perception(Occ) + Reasoning 156 | - Publish Date: 2024.09.05 157 | - Summary: 158 | - OccLLaMA, a unified 3D occupancy-language-action generative world model, which unifies VLA-related tasks including but not limited to scene understanding, planning, and 4D occupancy forecasting. 159 | - A novel scene tokenizer(VQVAE-like architecture) that efficiently discretize and reconstruct Occ scenes, considering sparsity and classes imbalance. 160 | 161 | - [ContextVLM: Zero-Shot and Few-Shot Context Understanding for Autonomous Driving using Vision Language Models](https://arxiv.org/abs/2409.00301) 162 | - Shounak Sural, Naren, Ragunathan Rajkumar **ITSC 2024** 163 | - Publisher: Carnegie Mellon University 164 | - Task: Context Recognition 165 | - Code: [ContextVLM](https://github.com/ssuralcmu/ContextVLM) 166 | - Publish Date: 2024.08.30 167 | - Summary: 168 | - DrivingContexts, a large publicly-available datasetwith a combination of hand-annotated and machine annnotated labels to improve VLMs for better context recognition. 169 | - ContextVLM uses vision-language models to detect contexts using zero- and few-shot approaches. 170 | 171 | - [DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving](https://arxiv.org/abs/2408.16647) 172 | - Yongjie Fu, Anmol Jain, Xuan Di, Xu Chen, Zhaobin Mo **IAVVC 2024** 173 | - Publisher: Columbia University 174 | - Task: Generation 175 | - Dataset: [Waymo open dataset](https://waymo.com/open/) 176 | - Publish Date: 2024.08.29 177 | - Summary: 178 | - DriveGenVLM employ a video generation framework based on Denoising Diffusion Probabilistic Models to create realistic video sequences that mimic real-world dynamics. 179 | - The videos generated are then evaluated for their suitability in Visual Language Models (VLMs) using a pre-trained model called Efficient In-context Learning on Egocentric Videos (EILEV). 180 | 181 | - [Edge-Cloud Collaborative Motion Planning for Autonomous Driving with Large Language Models](https://arxiv.org/abs/2408.09972) 182 | - Jiao Chen, Suyan Dai, Fangfang Chen, Zuohong Lv, Jianhua Tang 183 | - Publisher: South China University of Technology, Pazhou Lab 184 | - Task: Planning + QA 185 | - Project Page: [EC-Drive](https://sites.google.com/view/ec-drive) 186 | - Publish Date: 2024.08.19 187 | - Summary: 188 | - EC-Drive, a novel edge-cloud collaborative autonomous driving system. 189 | 190 | - [V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models](https://arxiv.org/abs/2408.09251) 191 | - Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran 192 | - Publisher: University of Wisconsin-Madison, Nanyang Technological University, Texas A&M University, Cornell University 193 | - Task: Planning 194 | - Projcet Page: [V2X-VLM](https://zilin-huang.github.io/V2X-VLM-website/) 195 | - Code: [V2X-VLM](https://github.com/zilin-huang/V2X-VLM) 196 | - Dataset: [DAIR-V2X](https://github.com/AIR-THU/DAIR-V2X) 197 | - Publish Date: 2024.08.09 198 | - Summary: 199 | - V2X-VLM, a large vision-language model empowered E2E VICAD framework, which improves the ability of autonomous vehicles to navigate complex traffic scenarios through advanced multimodal understanding and decision-making. 200 | - A contrastive learning technique is employed to refine the model’s ability to distinguish between relevant and irrelevant features, which ensures that the model learns robust and discriminative representations of specific driving environments, leading to improved accuracy in trajectory planning in V2X cooperation scenarios. 201 | 202 | - [VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving](https://arxiv.org/abs/2408.04821) 203 | - Keke Long, Haotian Shi, Jiaxi Liu, Xiaopeng Li 204 | - Publisher: University of Wisconsin-Madison 205 | - Task: Planning 206 | - Publish Date: 2024.08.04 207 | - Summary: 208 | - It proposed a closed-loop autonomous driving controller that applies VLMs for high-level vehicle control. 209 | - The upper-level VLM uses the vehicle's front camera images, textual scenario description, and experience memory as inputs to generate control parameters needed by the lower-level MPC. 210 | - The lower-level MPC utilizes these parameters, considering vehicle dynamics with engine lag, to achieve realistic vehicle behavior and provide state feedback to the upper level. 211 | - This asynchronous two-layer structure addresses the current issue of slow VLM response speeds. 212 | 213 | - [SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving](https://arxiv.org/abs/2407.21293) 214 | - Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu **IEIT Systems** 215 | - Publisher: University of Wisconsin-Madison 216 | - Task: QA 217 | - Publish Date: 2024.07.31 218 | - Summary: 219 | - SimpleLLM4AD reimagines the traditional autonomous driving pipeline by structuring the task into four interconnected stages: perception, prediction, planning, and behavior. 220 | - Each stage is framed as a series of visual question answering (VQA) pairs, which are interlinked to form a Graph VQA (GVQA). This graph-based structure allows the system to reason about each VQA pair systematically, ensuring a coherent flow of information and decision-making from perception to action. 221 | 222 | - [Testing Large Language Models on Driving Theory Knowledge and Skills for Connected Autonomous Vehicles](https://arxiv.org/abs/2407.17211) 223 | - Zuoyin Tang, Jianhua He, Dashuai Pei, Kezhong Liu, Tao Gao 224 | - Publisher: Aston University, Essex University, Wuhan University of Technology, Chang’An University 225 | - Task: Evaluation 226 | - Publish Date: 2024.07.24 227 | - Data: [UK Driving Theory Test Practice Questions and Answers](https://www.drivinginstructorwebsites.co.uk/uk-driving-theory-test-practice-questions-and-answers) 228 | - Summary: 229 | - Design and run driving theory tests for several proprietary LLM models (OpenAI GPT models, Baidu Ernie and Ali QWen) and open-source LLM models (Tsinghua MiniCPM-2B and MiniCPM-Llama3-V2.5) with more than 500 multiple-choices theory test questions. 230 | 231 | - [KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models](https://arxiv.org/abs/2407.14239) 232 | - Kemou Jiang, Xuan Cai, Zhiyong Cui, Aoyong Li, Yilong Ren, Haiyang Yu, Hao Yang, Daocheng Fu, Licheng Wen, Pinlong Cai 233 | - Publisher: Beihang University, Johns Hopkins University, Shanghai Artificial Intelligence Laboratory 234 | - Task: Multi Agent Planning 235 | - Env: [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 236 | - Project Page: [KoMA](https://jkmhhh.github.io/KoMA/) 237 | - Publish Date: 2024.07.19 238 | - Summary: 239 | - Introduce a knowledge-driven autonomous driving framework KoMA that incorporates multiple agents empowered by LLMs, comprising five integral modules: Environment, Multi-agent Interaction, Multi-step Planning, Shared Memory, and Ranking-based Reflection. 240 | 241 | - [WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning](https://arxiv.org/abs/2407.04281) 242 | - Yiheng Li, Chongjian Ge, Chenran Li, Chenfeng Xu, Masayoshi Tomizuka, Chen Tang, Mingyu Ding, Wei Zhan 243 | - Publisher: UC Berkeley, UT Austin 244 | - Task: Dataset + Reasoning 245 | - Publish Date: 2024.07.05 246 | - Datasets: [WOMD-Reasoning](https://waymo.com/open/download) 247 | - Summary: 248 | - WOMD-Reasoning, a language dataset centered on interaction descriptions and reasoning. It provides extensive insights into critical but previously overlooked interactions induced by traffic rules and human intentions. 249 | - Develop an automatic language labeling pipeline, leveraging a rule-based translator to interpret motion data into language descriptions, and a set of manual prompts for ChatGPT to generate Q&A pairs. 250 | 251 | - [Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction](https://ieeexplore.ieee.org/document/10568360) 252 | - Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani **IEEE TIV 2024** 253 | - Publisher: Tohoku University, RIKEN Center for AIP, DENSO CORPORATION 254 | - Task: Prediction 255 | - Code: [DHPR](https://github.com/DHPR-dataset/DHPR-dataset) 256 | - Publish Date: 2024.06.21 257 | - Summary: 258 | - DHPR (Driving Hazard Prediction and Reasoning) dataset, consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. 259 | - Present several baseline methods and evaluate their performance. 260 | 261 | - [Asynchronous Large Language Model Enhanced Planner for Autonomous Driving](https://arxiv.org/abs/2406.14556) 262 | - Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu **ECCV 2024** 263 | - Publisher: Beihang University, Tsinghua University 264 | - Task: Planning 265 | - Publish Date: 2024.06.20 266 | - Code: [AsyncDriver](https://github.com/memberRE/AsyncDriver) 267 | - Datasets: [nuPlan Closed-Loop Reactive Hard20](https://www.nuscenes.org/nuplan) 268 | - Summary: 269 | - AsyncDriver, a novel asynchronous LLM-enhanced framework, in which the inference frequency of LLM is controllable and can be decoupled from that of the real-time planner. 270 | - Adaptive Injection Block, which is model-agnostic and can easily integrate scene-associated instruction features into any transformer based 271 | real-time planner, enhancing its ability in comprehending and following series of language-based routing instructions. 272 | - Compared with existing methods, our approach demonstrates superior closedloop evaluation performance in nuPlan’s challenging scenarios. 273 | 274 | - [A Superalignment Framework in Autonomous Driving with Large Language Models](https://arxiv.org/abs/2406.05651) 275 | - Xiangrui Kong, Thomas Braunl, Marco Fahmi, Yue Wang 276 | - Publisher: University of Western Australia, Queensland Government, Brisbane, Queensland University of Technology 277 | - Task: QA 278 | - Publish Date: 2024.06.09 279 | - Summary 280 | - Propose a secure interaction framework for LLMs to effectively audit data interacting with cloud-based LLMs. 281 | - Analyze 11 autonomous driving methods based on large language models, including driving safety, token usage, privacy, and consistency with human values. 282 | - Evaluate the effectiveness of driving prompts in the nuScenesQA dataset and compare different results between gpt-35-turbo and llama2-70b LLM backbones. 283 | 284 | - [PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning](https://arxiv.org/abs/2406.01587) 285 | - Yupeng Zheng, Zebin Xing, Qichao Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia, Kun Zhan, Xianpeng Lang, Yaran Chen, Dongbin Zhao 286 | - Publisher: Chinese Academy of Sciences, Beijing University of Posts and Telecommunications, Beihang University, Tsinghua University, Li Auto 287 | - Task: Planning 288 | - Publish Date: 2024.06.04 289 | - Summary: 290 | - PlanAgent is the first closed-loop mid-to-mid(use bev, no raw sensor) autonomous driving planning agent system based on a Multi-modal Large Language Model. 291 | - Propose an efficient Environment Transformation module that extracts multi-modal information inputs with lanegraph representation. 292 | - Design a Reasoning Engine module that introduces a hierarchical chain-of-thought (CoT) to instruct MLLM to generate planner code and a Reflection module that combines simulation and scoring to filter out unreasonable proposals generated by the MLLM. 293 | 294 | - [ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles](https://arxiv.org/abs/2405.14062) 295 | - Jiawei Zhang, Chejian Xu, Bo Li **CVPR 2024** 296 | - Publisher: UIUC, UChicago 297 | - Task: Scenario Generation 298 | - Env: [Carla](https://github.com/carla-simulator) 299 | - Code: [ChatScene](https://github.com/javyduck/ChatScene) 300 | - Publish Date: 2024.05.22 301 | - Summary: 302 | - ChatScene, a novel LLM-based agent capable of generating safety-critical scenarios by first providing textual descriptions and then carefully transforming them into executable simulations in CARLA via Scenic programming language. 303 | - An expansive retrieval database of Scenic code snippets has been developed. It catalogs diverse adversarial behaviors and traffic configurations, utilizing the rich knowledge stored in LLMs, which significantly augments the variety and critical nature of the driving scenarios generated. 304 | 305 | - [Probing Multimodal LLMs as World Models for Driving](https://arxiv.org/abs/2405.05956) 306 | - Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus 307 | - Publisher: MIT CSAIL, TRI, MIT LID 308 | - Task: Benchmark & Evaluation 309 | - Code: [DriveSim](https://github.com/sreeramsa/DriveSim) 310 | - Publish Date: 2024.05.09 311 | - Summary: 312 | - A comprehensive experimental study to evaluate the capability of different MLLMs to reason/understand scenarios involving closed-loop driving and making decisions. 313 | - DriveSim, a specialized simulator designed to generate a diverse array of driving scenarios, thereby providing a platform to test and evaluate/benchmark the capabilities of MLLMs in understanding and reasoning about real-world driving scenes from a fixed in-car camera perspective, the same as the drive viewpoint. 314 | 315 | - [OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception Reasoning and Planning](https://arxiv.org/abs/2405.01533) 316 | - Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez 317 | - Publisher: Beijing Inst of Tech, NVIDIA, Huazhong Univ of Sci and Tech 318 | - Task: Benchmark & Planning 319 | - Publisher Data: 2024.05.02 320 | - Code: [OmniDrive](https://github.com/NVlabs/OmniDrive) 321 | - Summary: 322 | - OmniDrive, a holistic framework for strong alignment between agent models and 3D driving tasks. 323 | - Propose a new benchmark with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. 324 | 325 | - [REvolve: Reward Evolution with Large Language Models for Autonomous Driving](https://arxiv.org/abs/2406.01309) 326 | - Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, Pedro Zuidberg Dos Martires 327 | - Publisher: Centre for Applied Autonomous Sensor Systems (AASS), Örebro University, Swede 328 | - Task: Reward Generation 329 | - Env: [AirSim](https://github.com/microsoft/AirSim?tab=readme-ov-file) 330 | - Project Page: [REvolve](https://rishihazra.github.io/REvolve/) 331 | - Publish Date: 2024.04.09 332 | - Summary: 333 | - Reward Evolve (REvolve), a novel evolutionary framework using LLMs, specifically GPT-4, to output reward functions (as executable Python codes) for AD and evolve them based on human feedback. 334 | 335 | - [AGENTSCODRIVER: Large Language Model Empowered Collaborative Driving with Lifelong Learning](https://arxiv.org/pdf/2404.06345.pdf) 336 | - Senkang Hu, Zhengru Fang, Zihan Fang, Xianhao Chen, Yuguang Fang 337 | - Publisher: City University of Hong Kong, The University of Hong Kong 338 | - Task: Planning(Multiple vehicles collaborative) 339 | - Publish Date: 2024.04.09 340 | - Env: [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 341 | - Summary: 342 | - AGENTSCODRIVER, an LLM-powered multi-vehicle collaborative driving framework with lifelong learning, which allows different driving agents to communicate with each other and collaboratively drive in complex traffic scenarios. 343 | - It features reasoning engine, cognitive memory, reinforcement reflection, and communication module. 344 | 345 | - [Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving](https://arxiv.org/abs/2403.19838) 346 | - Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi 347 | - Publisher: UCSD 348 | - Task: QA 349 | - Publish Date: 2024.03.28 350 | - Code: [official](https://github.com/akshaygopalkr/EM-VLM4AD) 351 | - Datasets: [DriveLM](https://github.com/OpenDriveLab/DriveLM) 352 | - Summary: 353 | - EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving. 354 | - EM-VLM4AD requires at least 10 times less memory and floating point operations, while also achieving higher BLEU-4, METEOR, CIDEr, and ROGUE scores than the existing baseline on the DriveLM dataset. 355 | 356 | - [LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language Models](https://arxiv.org/abs/2403.18344) 357 | - Mingxing Peng, Xusen Guo, Xianda Chen, Meixin Zhu, Kehua Chen, Hao (Frank) Yang, Xuesong Wang, Yinhai Wang 358 | - Publisher: The Hong Kong University of Science and Technology, Johns Hopkins University, Tongji University, STAR Lab 359 | - Task: Trajectory Prediction 360 | - Publish Date: 2024.03.27 361 | - Datasets: [highD](https://levelxdata.com/highd-dataset/) 362 | - Summary: 363 | - LC-LLM, the first Large Language Model for lane change prediction. It leverages the powerful capabilities of LLMs to understand complex interactive scenarios, enhancing the performance of lane change prediction. 364 | - LC-LLM achieves explainable predictions. It not only predicts lane change intentions and trajectories but also generates explanations for the prediction results. 365 | 366 | - [AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving](https://arxiv.org/abs/2403.17373) 367 | - Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker 368 | - Publisher: Northwestern University, NEC Laboratories America, Rutgers University, UC San Diego 369 | - Publish Date: 2024.03.26 370 | - Task: Object Detection 371 | - Datasets: [Mapillary](https://www.mapillary.com/dataset/vistas), [Cityscapes](https://www.cityscapes-dataset.com/), [nuImages](https://www.nuscenes.org/nuimages), [BDD100k](https://www.vis.xyz/bdd100k/), [Waymo](https://waymo.com/open/), [KITTI](https://www.cvlibs.net/datasets/kitti/) 372 | - Summary: 373 | - An Automatic Data Engine (AIDE) that can automatically identify the issues, efficiently curate data, improve the model using auto-labeling, and verify the model through generated diverse scenarios. 374 | 375 | - [Engineering Safety Requirements for Autonomous Driving with Large Language Models](https://arxiv.org/abs/2403.16289) 376 | - Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Hȧkan Sivencrona, Christian Berger 377 | - Publisher: Chalmers University of Technology, University of Gothenburg, Volvo Cars, Zenseact, University of Gothenburg 378 | - Task: QA 379 | - Publish Date: 2024.03.24 380 | - Summary: 381 | - Propose a prototype of a pipeline of prompts and LLMs that receives an item definition and outputs solutions in the form of safety requirements. 382 | 383 | - [LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving](https://arxiv.org/abs/2403.20116) 384 | - Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, K. Madhava Krishna 385 | - Publisher: The International Institute of Information Technology, Hyderabad, University of Tartu, Estonia 386 | - Project Page: [LeGo-Drive](https://reachpranjal.github.io/lego-drive/) 387 | - Code: [LeGo-Drive](https://github.com/reachpranjal/lego-drive) 388 | - Env: [Carla](https://github.com/carla-simulator) 389 | - Task: Trajectory Prediction 390 | - Publish Date: 2024.03.20 391 | - Summary: 392 | - A novel planning-guided end-to-end LLM-based goal point navigation solution that predicts and improves the desired state by dynamically interacting with the 393 | environment and generating a collision-free trajectory. 394 | 395 | - [Hybrid Reasoning Based on Large Language Models for Autonomous Car Driving](https://arxiv.org/abs/2402.13602v3) 396 | - Mehdi Azarafza, Mojtaba Nayyeri, Charles Steinmetz, Steffen Staab, Achim Rettberg 397 | - Publisher: Univ. of Applied Science Hamm-Lippstadt, University of Stuttgart 398 | - Publish Date: 2024.03.18 399 | - Task: Reasoning 400 | - Env: [Carla](https://github.com/carla-simulator) 401 | - Summary: 402 | - Combining arithmetic and commonsense elements, utilize the objects detected by YOLOv8. 403 | - Regarding the "location of the object," "speed of our car," "distance to the object," and "our car’s direction" are fed into the large language model for mathematical calculations within CARLA. 404 | - After formulating these calculations based on overcoming weather conditions, precise control values for brake and speed are generated. 405 | 406 | - [Large Language Models Powered Context-aware Motion Prediction](https://arxiv.org/pdf/2403.11057.pdf) 407 | - Xiaoji Zheng, Lixiu Wu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, Jiangtao Gong 408 | - Publisher: Tsinghua University 409 | - Task: Motion Prediction 410 | - Publish Data: 2024.03.17 411 | - Dataset: [WOMD](https://github.com/waymo-research/waymo-open-dataset) 412 | - Summary: 413 | - Design and conduct prompt engineering to enable an unfine-tuned GPT4-V to comprehend complex traffic scenarios. 414 | - Introduced a novel approach that combines the context information outputted by GPT4-V with [MTR](https://arxiv.org/abs/2209.13508). 415 | 416 | - [Generalized Predictive Model for Autonomous Driving](https://arxiv.org/abs/2403.09630) 417 | - Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li **ECCV 2024** 418 | - Publisher: OpenDriveLab and Shanghai AI Lab, Hong Kong University of Science and Technology, University of Hong Kong, University of Tubingen, Tubingen AI Center 419 | - Task: Datasets + Generation 420 | - Code: [DriveAGI](https://github.com/OpenDriveLab/DriveAGI.) 421 | - Publish Date: 2024.03.14 422 | - Summary: 423 | - Introduce the first large-scale video prediction model in the autonomous driving discipline. 424 | - The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. 425 | - GenAD, inheriting the merits from recent latent diffusion models, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. 426 | 427 | - [LLM-Assisted Light: Leveraging Large Language Model Capabilities for Human-Mimetic Traffic Signal Control in Complex Urban Environments](https://arxiv.org/abs/2403.08337) 428 | - Maonan Wang, Aoyu Pang, Yuheng Kan, Man-On Pun, Chung Shue Chen, Bo Huang 429 | - Publisher: The Chinese University of Hong Kong, Shanghai AI Laboratory, SenseTime Group Limited, Nokia Bell Labs 430 | - Publish Date: 2024.03.13 431 | - Task: Generation 432 | - Code: [LLM-Assisted-Light](https://github.com/Traffic-Alpha/LLM-Assisted-Light) 433 | - Summary: 434 | - LA-Light, a hybrid TSC framework that integrates the human-mimetic reasoning capabilities of LLMs, enabling the signal control algorithm to interpret and respond to complex traffic scenarios with the nuanced judgment typical of human cognition. 435 | - A closed-loop traffic signal control system has been developed, integrating LLMs with a comprehensive suite of interoperable tools. 436 | 437 | - [DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation](https://arxiv.org/abs/2403.06845) 438 | - Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang 439 | - Publisher: Institute of Automation, Chinese Academy of Sciences, GigaAI 440 | - Publish Date: 2024.03.11 441 | - Task: Generation 442 | - Project: [DriveDreamer-2](https://drivedreamer2.github.io) 443 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 444 | - Summary: 445 | - DriveDreamer-2, which builds upon the framework of [DriveDreamer](#DriveDreamer) and incorporates a Large Language Model (LLM) to generate user-defined driving videos. 446 | - UniMVM(Unified Multi-View Model) enhances temporal and spatial coherence in the generated driving videos. 447 | - HDMap generator ensure the background elements do not conflict with the foreground trajectories. 448 | - Utilize the constructed text-to-script dataset to finetune GPT-3.5 into an LLM with specialized trajectory generation knowledge. 449 | 450 | - [Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents](https://arxiv.org/abs/2402.05746) 451 | - Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, Yanfeng Wang 452 | - Publisher: Shanghai Jiao Tong University, Shanghai AI Laboratory, Carnegie Mellon University, Tsinghua University 453 | - Publish Date: 2024.03.11 454 | - Task: Generation 455 | - Code: [ChatSim](https://github.com/yifanlu0227/ChatSim) 456 | - Datasets: [Waymo](https://waymo.com/open/) 457 | - Summary: 458 | - ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. 459 | - McNeRF, a novel neural radiance field method that incorporates multi-camera inputs, offering a broader scene rendering. It helps generate photo-realistic outcomes. 460 | - McLight, a novel multicamera lighting estimation that blends skydome and surrounding lighting. It makes external digital assets with their realistic textures and materials. 461 | 462 | - [Embodied Understanding of Driving Scenarios](https://arxiv.org/abs/2403.04593) 463 | - Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li **ECCV 2024** 464 | - Shanghai AI Lab, Shanghai Jiao Tong University, University of California, Riverside 465 | - Publish Date: 2024.03.07 466 | - Task: Benchmark & Scene Understanding 467 | - Code: [ELM](https://github.com/OpenDriveLab/ELM) 468 | - Summary: 469 | - ELM is an embodied language model for understanding the long-horizon driving scenarios in space and time. 470 | 471 | - [DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models](https://arxiv.org/abs/2402.12289) 472 | - Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao 473 | - Publisher: IIIS, Tsinghua University, Li Auto 474 | - Publish Date: 2024.02.25 475 | - Task: + Planning 476 | - Project: [DriveVLM](https://tsinghua-mars-lab.github.io/DriveVLM/) 477 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes), SUP-AD 478 | - Summary: 479 | - DriveVLM, a novel autonomous driving system that leverages VLMs for effective scene understanding and planning. 480 | - DriveVLM-Dual, a hybrid system that incorporates DriveVLM and a traditional autonomous pipeline. 481 | 482 | - [GenAD: Generative End-to-End Autonomous Driving](https://arxiv.org/abs/2402.11502) 483 | - Wenzhao Zheng, Ruiqi Song, Xianda Guo, Long Chen **ECCV 2024** 484 | - University of California, Berkeley, Waytous, Institute of Automation, Chinese Academy of Sciences 485 | - Publish Date: 2024.02.20 486 | - Task: Generation 487 | - Code: [GenAD](https://github.com/wzzheng/GenAD) 488 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 489 | - Summary: 490 | - GenAD models autonomous driving as a trajectory generation problem to unleash the full potential of endto-end methods. 491 | - Propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. 492 | - Employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling and adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. 493 | 494 | - [RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model](https://arxiv.org/abs/2402.10828) 495 | - Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, Matthew Gadd 496 | - Publisher: University of Oxford, Beijing Academy of Artificial Intelligence 497 | - Publish Date: 2024.02.16 498 | - Task: Explainable Driving 499 | - Project: [RAG-Driver](https://yuanjianhao508.github.io/RAG-Driver/) 500 | - Summary: 501 | - RAG-Driver is a Multi-Modal Large Language Model with Retrieval-augmented In-context Learning capacity designed for generalisable and explainable end-to-end driving with strong zero-shot generalisation capacity. 502 | - Achieve State-of-the-art action explanation and justification performance in both BDD-X (in-distribution) and SAX (out-distribution) benchmarks. 503 | 504 | - [Driving Everywhere with Large Language Model Policy Adaptation](https://arxiv.org/abs/2402.05932) 505 | - Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, Marco Pavone **CVPR 2024** 506 | - Publisher: NVIDIA, University of Southern California, University of Washington, Stanford University 507 | - Publish Date: 2024.02.08 508 | - Task: Planning 509 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 510 | - Project: [LLaDA](https://boyiliee.github.io/llada/) 511 | - Summary: 512 | - LLaDA is a training-free mechanism to assist human drivers and adapt autonomous driving policies to new environments. 513 | - Traffic Rule Extractor (TRE), which aims to organize and filter the inputs (initial plan+unique traffic code) and feed the output into the frozen LLMs to obtain the final new plan. 514 | - LLaDA set GPT-4 as default LLM. 515 | 516 | - [LimSim++](https://arxiv.org/abs/2402.01246) 517 | - Daocheng Fu, Wenjie Lei, Licheng Wen, Pinlong Cai, Song Mao, Min Dou, Botian Shi, Yu Qiao 518 | - Publisher: Shanghai Artificial Intelligence Laboratory, Zhejiang University 519 | - Publish Date: 2024.02.02 520 | - Project: [LimSim++](https://pjlab-adg.github.io/limsim_plus/) 521 | - Summary: 522 | - LimSim++, an extended version of LimSim designed for the application of (M)LLMs in autonomous driving. 523 | - Introduce a baseline (M)LLM-driven framework, systematically validated through quantitative experiments across diverse scenarios. 524 | 525 | - [LangProp: A code optimization framework using Language Models applied to driving](https://openreview.net/forum?id=UgTrngiN16) 526 | - Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F. Henriques, Anthony Hu 527 | - Publisher: Wayve Technologies, Visual Geometry Group, University of Oxford 528 | - Publish Date: 2024.01.18 529 | - Task: Code generation, Planning 530 | - Code: [LangProp](https://github.com/shuishida/LangProp) 531 | - Env: [CARLA](https://github.com/carla-simulator) 532 | - Summary: 533 | - LangProp is a framework for iteratively optimizing code generated by large language models (LLMs) in a supervised/reinforcement learning setting. 534 | - Use LangProp in CARLA and generate driving code based on the state of the scene. 535 | 536 | - [VLP: Vision Language Planning for Autonomous Driving](https://arxiv.org/abs/2401.05577) 537 | - Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, Liu Ren **CVPR 2024** 538 | - Publisher: Syracuse University, Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI) 539 | - Publish Date: 2024.01.14 540 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 541 | - Summary: 542 | - Propose VLP, a Vision Language Planning model, which is composed of novel components ALP and SLP, aiming to improve the ADS from self-driving BEV reasoning and self-driving decision-making aspects, respectively. 543 | - ALP(agent-wise learning paradigm) aligns the produced BEV with a true bird’s-eye-view map. 544 | - SLP(selfdriving-car-centric learning paradigm) aligns the ego-vehicle query feature with the ego-vehicle textual planning feature. 545 | 546 | - [DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving](https://arxiv.org/abs/2401.03641) 547 | - Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, and Jianbing Shen 548 | - Publisher: SKL-IOTSC, CIS, University of Macau 549 | - Publish Date: 2024.01.08 550 | - Summary: 551 | - DME-Driver = Decision-Maker + Executor + CL 552 | - Executor network which is based on UniAD incorporates textual information for the OccFormer and the Planning module. 553 | - Decision-Maker which is based on LLaVA process inputs from three different modalities: visual inputs from the current and previous scenes textual inputs in the form of prompts, and current status information detailing the vehicle’s operating state. 554 | - CL is a consistency loss mechanism, slightly reducing performance metrics but significantly enhancing decision alignment between Executor and Decision-Maker. 555 | 556 | - [AccidentGPT: Accident Analysis and Prevention from V2X Environmental Perception with Multi-modal Large Model](https://arxiv.org/abs/2312.13156) 557 | - Lening Wang, Yilong Ren, Han Jiang, Pinlong Cai, Daocheng Fu, Tianqi Wang, Zhiyong Cui, Haiyang Yu, Xuesong Wang, Hanchu Zhou, Helai Huang, Yinhai Wang 558 | - Publisher: Beihang University, Shanghai Artificial Intelligence Laboratory, The University of Hong Kong, Zhongguancun Laboratory, Tongji University, Central South University, University of Washington, Seattle 559 | - Publish Date: 2023.12.29 560 | - Project page: [AccidentGPT](https://accidentgpt.github.io) 561 | - Summary: 562 | - AccidentGPT, a comprehensive accident analysis and prevention multi-modal large model. 563 | - Integrates multi-vehicle collaborative perception for enhanced environmental understanding and collision avoidance. 564 | - Offer advanced safety features such as proactive remote safety warnings and blind spot alerts. 565 | - Serve traffic police and management agencies by providing real-time intelligent analysis of traffic safety factors. 566 | 567 | - [Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models](https://arxiv.org/abs/2401.00988) 568 | - Xinpeng Ding, Jinahua Han, Hang Xu, Xiaodan Liang, Wei Zhang, Xiaomeng Li **CVPR 2024** 569 | - Publisher: Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab, Sun Yat-Sen University 570 | - Publish Date: 2023.12.21 571 | - Task: Datasets + VQA 572 | - Code: [official](https://github.com/xmed-lab/NuInstruct) 573 | - Summary: 574 | - Introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, which based on [nuScenes](https://www.nuscenes.org/nuscenes). 575 | - Propose BEV-InMLMM to integrate instructionaware BEV features with existing MLLMs, enhancing them with a full suite of information, including temporal, multi-view, and spatial details. 576 | 577 | - [LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning](https://arxiv.org/abs/2401.00125) 578 | - S P Sharan, Francesco Pittaluga, Vijay Kumar B G, Manmohan Chandraker 579 | - Publisher: UT Austin, NEC Labs America, UC San Diego 580 | - Publish Date: 2023.12.30 581 | - Task: Planning 582 | - Env/Datasets: nuPlan Closed-Loop Non-Reactive Challenge 583 | - Project: [LLM-ASSIST](https://llmassist.github.io/) 584 | - Summary: 585 | - LLM-Planner takes over scenarios that PDM-Closed cannot handle 586 | - Propose two LLM-based planners. 587 | - LLM-ASSIST(unc) considers the most unconstrained version of the planning problem, in which the LLM must directly return a safe future trajectory for the ego car. 588 | - LLM-ASSIST(par) considers a parameterized version of the planning problem, in which the LLM must only return a set of parameters for a rule-based planner, PDM-Closed. 589 | 590 | - [DriveLM: Driving with Graph Visual Question Answering](https://arxiv.org/pdf/2312.14150.pdf) 591 | - Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li **ECCV 2024** 592 | - Publisher: OpenDriveLab, University of Tübingen, Tübingen AI Center, University of Hong Kong 593 | - Code: [official](https://github.com/OpenDriveLab/DriveLM) 594 | - Publish Date: 2023.12.21 595 | - Summary: 596 | - DriveLM-Task 597 | - Graph VQA involves formulating P1-3(Perception, Prediction, Planning) reasoning as a series of questionanswer pairs (QAs) in a directed graph. 598 | - DriveLM-Data 599 | - DriveLM-Carla 600 | - Collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework [17] with a privileged rule-based expert. 601 | - Drive-nuScenes 602 | - Selecting key frames from video clips, choosing key objects within these key frames, and subsequently annotating the frame-level P1−3 QAs for these key objects. A portion of the Perception QAs are generated from the nuScenes and [OpenLane-V2](https://github.com/OpenDriveLab/OpenLane-V2) ground truth, while the remaining QAs are manually annotated. 603 | - DriveLM-Agent 604 | - DriveLMAgent is built upon a general vision-language model and can therefore exploit underlying knowledge gained during pre-training. 605 | 606 | - [LingoQA: Video Question Answering for Autonomous Driving](https://arxiv.org/abs/2312.14115) 607 | - Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Oleg Sinavski 608 | - Publisher: Wayve 609 | - Task: VQA + Evaluation/Datasets 610 | - Code: [official](https://github.com/wayveai/LingoQA) 611 | - Publish Date: 2023.12.21 612 | - Summary: 613 | - Introduce a novel benchmark for autonomous driving video QA using a learned text classifier for evaluation. 614 | - Introduce a Video QA dataset of central London consisting of 419k samples with its free-form questions and answers. 615 | - Establish a new baseline based on Vicuna-1.5-7B for this field with an identified model combination. 616 | 617 | - [DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving](https://arxiv.org/abs/2312.09245) 618 | - Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai 619 | - Publisher: OpenGVLab, Shanghai AI Laboratory, The Chinese University of Hong Kong, SenseTime Research, Stanford University, Nanjing University, Tsinghua University 620 | - Task: Planning + Explanation 621 | - Code: [official](https://github.com/OpenGVLab/DriveMLM) 622 | - Env: [Carla](https://carla.org/) 623 | - Publish Date: 2023.12.14 624 | - Summary: 625 | - DriveMLM, the first LLM-based AD framework that can perform close-loop 626 | autonomous driving in realistic simulators. 627 | - Design an MLLM planner for decision prediction, and develop a data engine that can effectively generate decision states and corresponding explanation annotation for model training and evaluation. 628 | - Achieve 76.1 DS, 0.955 MPI results on CARLA Town05 Long. 629 | 630 | - [Large Language Models for Autonomous Driving: Real-World Experiments](https://arxiv.org/abs/2312.09397) 631 | - Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, Tianren Gao, Erlong Li, Kun Tang, Zhipeng Cao, Tong Zhou, Ao Liu, Xinrui Yan, Shuqi Mei, Jianguo Cao, Ziran Wang, Chao Zheng 632 | - Publisher: Purdue University 633 | - Publish Date: 2023.12.14 634 | - Project: [official](https://www.youtube.com/playlist?list=PLgcRcf9w8BmJfZigDhk1SAfXV0FY65cO7) 635 | - Summary: 636 | - Introduce a Large Language Model (LLM)-based framework Talk-to-Drive (Talk2Drive) to process verbal commands from humans and make autonomous driving decisions with contextual information, satisfying their personalized preferences for safety, efficiency, and comfort. 637 | 638 | - [LMDrive: Closed-Loop End-to-End Driving with Large Language Models](https://arxiv.org/abs/2312.07488) 639 | - Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, Hongsheng Li **CVPR 2024** 640 | - Publisher: CUHK MMLab, SenseTime Research, CPII under InnoHK, University of Toronto, Shanghai Artificial Intelligence Laboratory 641 | - Task: Planning + Datasets 642 | - Code: [official](https://github.com/opendilab/LMDrive) 643 | - Env: [Carla](https://carla.org/) 644 | - Publish Date: 2023.12.12 645 | - Summary: 646 | - LMDrive, a novel end-to-end, closed-loop, languagebased autonomous driving framework. 647 | - Release 64K clips dataset, including navigation instruction, notice instructions, multi-modal multi-view sensor data, and control signals. 648 | - Present the benchmark LangAuto for evaluating the autonomous agents. 649 | 650 | - [Evaluation of Large Language Models for Decision Making in Autonomous Driving](https://arxiv.org/pdf/2312.06351.pdf) 651 | - Kotaro Tanahashi, Yuichi Inoue, Yu Yamaguchi, Hidetatsu Yaginuma, Daiki Shiotsuka, Hiroyuki Shimatani, Kohei Iwamasa, Yoshiaki Inoue, Takafumi Yamaguchi, Koki Igari, Tsukasa Horinouchi, Kento Tokuhiro, Yugo Tokuchi, Shunsuke Aoki 652 | - Publisher: Turing Inc., Japan 653 | - Task: Evalution 654 | - Publish Date: 2023.12.11 655 | - Summary: 656 | - Evaluate the two core capabilities 657 | - the spatial awareness decision-making ability, that is, LLMs can accurately identify the spatial layout based on coordinate information; 658 | - the ability to follow traffic rules to ensure that LLMs Ability to strictly abide by traffic laws while driving 659 | 660 | - [LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs](https://arxiv.org/abs/2312.04372) 661 | - Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang 662 | - Publisher: Purdue University, University of Illinois Urbana-Champaign, University of Virginia, InfoTech Labs, Toyota Motor North American 663 | - Task: Benchmark 664 | - Publish Date: 2023.12.07 665 | - Summary: 666 | - LaMPilot is the first interactive environment and dataset designed for evaluating LLM-based agents in a driving context. 667 | - It contains 4.9K scenes and is specifically designed to evaluate command tracking tasks in autonomous driving. 668 | 669 | - [Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving](https://arxiv.org/abs/2312.03661) 670 | - Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang 671 | - Publisher: Fudan University, Huawei Noah’s Ark Lab 672 | - Task: VQA + Datasets 673 | - Code: [official](https://github.com/fudan-zvg/Reason2Drive) 674 | - Datasets: 675 | - [nuScenes](https://www.nuscenes.org/nuscenes) 676 | - [Waymo](https://waymo.com/open/) 677 | - [ONCE](https://once-for-auto-driving.github.io/index.html) 678 | - Publish Date: 2023.12.06 679 | - Summary: 680 | - Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving. 681 | - Introduce a novel evaluation metric to assess chain-based reasoning performance in autonomous driving environments, and address the semantic ambiguities of existing metrics such as BLEU and CIDEr. 682 | - Introduce a straightforward yet effective framework that enhances existing VLMs with two new components: a prior tokenizer and an instructed vision decoder. 683 | 684 | - [GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models](https://arxiv.org/abs/2312.03543) 685 | - Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, Chengzhong Xu 686 | - Publisher: University of Macau, UESTC, Chongqing University, Jilin University 687 | - Task: Detection/Prediction 688 | - Code: [official](https://github.com/Petrichor625/Talk2car_CAVG) 689 | - Datasets: 690 | - [Talk2car](https://github.com/talk2car/Talk2Car) 691 | - Publish Date: 2023.12.06 692 | - Summaray: 693 | - Utilize five encoder Text, Image, Context, and Cross-Modal—with a Multimodal decoder to pridiction object bounding box. 694 | 695 | - [Dolphins: Multimodal Language Model for Driving](https://arxiv.org/abs/2312.00438) 696 | - Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao **ECCV 2024** 697 | - Publisher: University of Wisconsin-Madison, NVIDIA, University of Michigan, Stanford University 698 | - Task: VQA 699 | - Project: [Dolphins](https://vlm-driver.github.io/) 700 | - Code: [Dolphins](https://github.com/vlm-driver/Dolphins) 701 | - Datasets: 702 | - Image instruction-following dataset 703 | - [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html) 704 | - [MSCOCO](https://cocodataset.org/#home): [VQAv2](https://visualqa.org/), [OK-VQA](https://okvqa.allenai.org/), [TDIUC](https://kushalkafle.com/projects/tdiuc.html), [Visual Genome dataset](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html) 705 | - Video instruction-following dataset 706 | - [BDD-X](https://github.com/JinkyuKimUCB/BDD-X-dataset) 707 | - Publish Date: 2023.12.01 708 | - Summary: 709 | - Dolphins which is base on OpenFlamingo architecture is a VLM-based conversational driving assistant. 710 | - Devise grounded CoT (GCoT) instruction tuning and develop datasets. 711 | 712 | - [Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving](https://arxiv.org/abs/2311.17918) 713 | - Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang 714 | - Publisher: CASIA, CAIR, HKISI, CAS 715 | - Task: Generation 716 | - Project: [Drive-WM](https://drive-wm.github.io/) 717 | - Code: [Drive-WM](https://github.com/BraveGroup/Drive-WM) 718 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes), [Waymo Open Dataset](https://waymo.com/open/) 719 | - Publish Date: 2023.11.29 720 | - Summary: 721 | - Drive-WM, a multiview world model, which is capable of generating high-quality, controllable, and consistent multiview videos in autonomous driving scenes. 722 | - The first to explore the potential application of the world model in end-to-end planning for autonomous driving. 723 | 724 | - [Empowering Autonomous Driving with Large Language Models: A Safety Perspective](https://arxiv.org/abs/2312.00812) 725 | - Yixuan Wang, Ruochen Jiao, Chengtian Lang, Sinong Simon Zhan, Chao Huang, Zhaoran Wang, Zhuoran Yang, Qi Zhu 726 | - Publisher: Northwestern University, University of Liverpool, Yale University 727 | - Task: Planning 728 | - Env: [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 729 | - Code: [official](https://github.com/wangyixu14/llm_conditioned_mpc_ad) 730 | - Publish Date: 2023.11.28 731 | - Summary: 732 | - Deploys the LLM as an intelligent decision-maker in planning, incorporating safety verifiers for contextual safety learning to enhance overall AD performance and safety. 733 | 734 | - [GPT-4V Takes the Wheel: Evaluating Promise and Challenges for Pedestrian Behavior Prediction](https://arxiv.org/abs/2311.14786) 735 | - Jia Huang, Peng Jiang, Alvika Gautam, Srikanth Saripalli 736 | - Publisher: Texas A&M University, College Station, USA 737 | - Task: Evaluation(Pedestrian Behavior Prediction) 738 | - Datasets: 739 | - [JAAD](https://data.nvision2.eecs.yorku.ca/JAAD_dataset/) 740 | - [PIE](https://data.nvision2.eecs.yorku.ca/PIE_dataset/) 741 | - [WiDEVIEW](https://github.com/unmannedlab/UWB_Dataset) 742 | - Summary: 743 | - Provides a comprehensive evaluation of the potential of GPT-4V for pedestrian behavior prediction in autonomous driving using publicly available datasets. 744 | - It still falls short of the state-of-the-art traditional domain-specific models. 745 | - While GPT-4V represents a considerable advancement in AI capabilities for pedestrian behavior prediction, ongoing development and refinement are necessary to fully harness its capabilities in practical applications. 746 | 747 | - [ADriver-I: A General World Model for Autonomous Driving](https://arxiv.org/abs/2311.13549) 748 | - Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, Tiancai Wang 749 | - Publisher: MEGVII Technology, Waseda University, University of Science and Technology of China, Mach Drive 750 | - Task: Generation + Planning 751 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes), Largescale private datasets 752 | - Publish Date: 2023.11.22 753 | - Summary: 754 | - ADriver-I takes the vision-action pairs as inputs and autoregressively predicts the control signal of current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. 755 | - MLLM(Multimodal large language model)=[LLaVA-7B-1.5](https://github.com/haotian-liu/LLaVA), VDM(Video Diffusion Model)=[latent-diffusion](https://github.com/CompVis/latent-diffusion) 756 | - Metrics: 757 | - L1 error including speed and steer angle of current frame. 758 | - Quality of Generation: Frechet Inception Distance(FID), Frechet Video Distance(FVD). 759 | 760 | - [A Language Agent for Autonomous Driving](https://arxiv.org/abs/2311.10813) 761 | - Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, Yue Wang 762 | - University of Southern California, Stanford University, NVIDIA 763 | - Task: Generation + Planning 764 | - Project: [Agent-Driver](https://usc-gvl.github.io/Agent-Driver/) 765 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 766 | - Publish Date: 2023.11.17 767 | - Summary: 768 | - Agent-Driver integrates a tool library for dynamic perception and prediction, a cognitive memory for human knowledge, and a reasoning engine that emulates human decision-making. 769 | - For motion planning, follow GPT-Driver(#GPT-Driver) and fine-tune the LLM with human driving trajectories in the nuScenes training set for one epoch. 770 | - For neural modules, adopte the modules in [UniAD](https://arxiv.org/abs/2212.10156). 771 | - Metric: 772 | - L2 error (in meters) and collision rate (in percentage). 773 | 774 | - [Human-Centric Autonomous Systems With LLMs for User Command Reasoning](https://arxiv.org/abs/2311.08206) 775 | - Yi Yang, Qingwen Zhang, Ci Li, Daniel Simões Marta, Nazre Batool, John Folkesson 776 | - Publisher: KTH Royal Institute of Technology, Scania AB 777 | - Task: QA 778 | - Code: [DriveCmd](https://github.com/KTH-RPL/DriveCmd_LLM) 779 | - Datasets: [UCU Dataset](https://github.com/LLVM-AD/ucu-dataset) 780 | - Publish Date: 2023.11.14 781 | - Summary: 782 | - Propose to leverage the reasoning capabilities of Large Language Models (LLMs) to infer system requirements from in-cabin users’ commands. 783 | - LLVM-AD Workshop @ WACV 2024 784 | - Metric: 785 | - Accuracy at the question level(accuracy for each individual question). 786 | - Accuracy at the command level(accuracy is only acknowledged if all questions for a particular command are correctly identified). 787 | 788 | - [On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving](https://arxiv.org/abs/2311.05332) 789 | - Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi 790 | - Publisher: Shanghai Artificial Intelligence Laboratory, GigaAI, East China Normal University, The Chinese University of Hong Kong, WeRide.ai 791 | - Project: [official](https://github.com/PJLab-ADG/GPT4V-AD-Exploration) 792 | - Datasets: 793 | - Scenario Understanding: [nuScenes](https://www.nuscenes.org/nuscenes), [BDD-X](https://github.com/JinkyuKimUCB/BDD-X-dataset), [Carla](https://github.com/carla-simulator), [TSDD](http://www.nlpr.ia.ac.cn/pal/trafficdata/detection.html), [Waymo](https://arxiv.org/abs/1912.04838), [DAIR-V2X](https://thudair.baai.ac.cn/index), [CitySim](https://github.com/ozheng1993/UCF-SST-CitySim-Dataset). 794 | - Reasoning Capability: [nuScenes](https://www.nuscenes.org/nuscenes), [D2-city](https://arxiv.org/abs/1904.01975), [Carla](https://github.com/carla-simulator), [CODA](https://arxiv.org/abs/2203.07724) and the internet 795 | - Act as a driver: Real-world driving scenarios. 796 | - Publish Date: 2023.11.9 797 | - Summary: 798 | - Conducted a comprehensive and multi-faceted evaluation of the GPT-4V in various autonomous driving scenarios. 799 | - Test the capabilities of GPT-4V in Scenario Understanding, Reasoning, Act as a driver. 800 | 801 | - [ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt](https://ieeexplore.ieee.org/document/10286969) 802 | - Shiyi Wang, Yuxuan Zhu, Zhiheng Li, Yutong Wang, Li Li, Zhengbing He 803 | - Publisher: Tsinghua University, Institute of Automation, Chinese Academy of Sciences, Massachusetts Institute of Technology 804 | - Task: Planning 805 | - Publish Date: 2023.10.17 806 | - Summary: 807 | - Design a universal framework that embeds LLMs as a vehicle "Co-Pilot" of driving, which can accomplish specific driving tasks with human intention satisfied based on the information provided. 808 | 809 | - [MagicDrive: Street View Generation with Diverse 3D Geometry Control](https://arxiv.org/abs/2310.02601) 810 | - Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu 811 | - Publisher: The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab 812 | - Task: Generation 813 | - Project: [MagicDrive](https://gaoruiyuan.com/magicdrive/) 814 | - Code: [MagicDrive](https://github.com/cure-lab/MagicDrive) 815 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 816 | - Publish Date: 2023.10.13 817 | - Summary: 818 | - MagicDrive generates highly realistic images, exploiting geometric information from 3D annotations by independently encoding road maps, object boxes, and camera parameters for precise, geometry-guided synthesis. This approach effectively solves the challenge of multi-camera view consistency. 819 | - It also faces huge challenges in some complex scenes, such as night views and unseen weather conditions. 820 | 821 | - [Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles](https://arxiv.org/abs/2310.08034) 822 | - Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang 823 | - Publisher: Purdue University, University of Illinois Urbana-Champaign,University of Virginia,PediaMed.AI. 824 | - Task: Planning 825 | - Project: [video](https://www.youtube.com/playlist?list=PLgcRcf9w8BmLJi_fqTGq-7KCZsbpEIE4a) 826 | - Env: [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 827 | - Publish Date: 2023.10.12 828 | - Summary: 829 | - Utilize LLMs’ linguistic and contextual understanding abilities with specialized tools to integrate the language and reasoning capabilities of LLMs into autonomous vehicles. 830 | 831 | - [DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model](https://arxiv.org/abs/2310.07771) 832 | - Xiaofan Li, Yifu Zhang, Xiaoqing Ye 833 | - Publisher: Baidu Inc. 834 | - Task: Generation 835 | - Project: [official](https://drivingdiffusion.github.io/) 836 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 837 | - Summary: 838 | - Address the new problem of multi-view video data generation from 3D layout in complex urban scenes.' 839 | - Propose a generative model DrivingDiffusion to ensure the cross-view, cross-frame consistency and the instance quality of the generated videos. 840 | - Achieve state-of-the-art video synthesis performance on nuScenes dataset. 841 | - Metrics: 842 | - Quality of Generation: Frechet Inception Distance(FID), Frechet Video Distance(FVD) 843 | - Segmentation Metrics: mIoU 844 | 845 | - [LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving](https://arxiv.org/pdf/2310.03026) 846 | - Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding 847 | - Publisher: Tsinghua University, The University of Hong Kong, University of California, Berkeley 848 | - Task: Planning/Control 849 | - Code: [official](https://sites.google.com/view/llm-mpc) 850 | - Env: 851 | - [ComplexUrbanScenarios](https://github.com/liuyuqi123/ComplexUrbanScenarios) 852 | - [Carla](https://github.com/carla-simulator) 853 | - Publish Date: 2023.10.04 854 | - Summary: 855 | - Leverage LLMs to provide high-level decisions through chain-of-thought. 856 | - Convert high-level decisions into mathematical representations to guide the bottom-level controller(MPC). 857 | - Metrics: Number of failure/collision cases, Inefficiency,time, Penalty 858 | 859 | - [Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving](https://browse.arxiv.org/abs/2310.01957) 860 | - Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton 861 | - Publisher: Wayve 862 | - Task: Planning + VQA 863 | - Code: [official](https://github.com/wayveai/Driving-with-LLMs) 864 | - Simulator: a custom-built realistic 2D simulator.(The simulator is not open source.) 865 | - Datasets: [Driving QA](https://github.com/wayveai/Driving-with-LLMs/tree/main/data), data collection using RL experts in simulator. 866 | - Publish Date: 2023.10.03 867 | - Summary: 868 | - Propose a unique object-level multimodal LLM architecture(Llama2+Lora), using only vectorized representations as input. 869 | - Develop a new dataset of 160k QA pairs derived from 10k driving scenarios(control commands collected by RL(PPO), QA pair generated by GPT-3.5) 870 | - Metrics: 871 | - Accuracy of traffic light detection 872 | - MAE for traffic light distance prediction 873 | - MAE for acceleration 874 | - MAE for brake pressure 875 | - MAE for steering wheel angle 876 | 877 | - [Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving](https://arxiv.org/abs/2310.02251) 878 | - Vikrant Dewangan, Tushar Choudhary, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna 879 | - Publisher: IIIT Hyderabad, University of British Columbia, University of Tartu, TensorTour Inc, MIT 880 | - Project Page: [official](https://llmbev.github.io/talk2bev/) 881 | - Code: [Talk2BEV](https://github.com/llmbev/talk2bev) 882 | - Publish Date: 2023.10.03 883 | - Summary: 884 | - Introduces Talk2BEV, a large visionlanguage model (LVLM) interface for bird’s-eye view (BEV) maps in autonomous driving contexts. 885 | - Does not require any training or finetuning, relying instead on pre-trained image-language models 886 | - Develop and release Talk2BEV-Bench, a benchmark encom- passing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset. 887 | 888 | - [DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model](https://arxiv.org/abs/2310.01412) 889 | - Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kenneth K. Y. Wong, Zhenguo Li, Hengshuang Zhao 890 | - Publisher: The University of Hong Kong, Zhejiang University, Huawei Noah’s Ark Lab, University of Sydney 891 | - Project Page: [official](https://tonyxuqaq.github.io/projects/DriveGPT4/) 892 | - Task: Planning/Control + VQA 893 | - Datasets: 894 | - [BDD-X dataset](https://github.com/JinkyuKimUCB/BDD-X-dataset). 895 | - Publish Date: 2023.10.02 896 | - Summary: 897 | - Develop a new visual instruction tuning dataset(based on BDD-X) for interpretable AD assisted by ChatGPT/GPT4. 898 | - Present a novel multimodal LLM called DriveGPT4(Valley + LLaVA). 899 | - Metrics: 900 | - BLEU4, CIDEr and METETOR, ChatGPT Score. 901 | - RMSE for control signal prediction. 902 | 903 | - [GPT-DRIVER: LEARNING TO DRIVE WITH GPT](https://browse.arxiv.org/abs/2310.01415v1) 904 | - Jiageng Mao, Yuxi Qian, Hang Zhao, Yue Wang 905 | - Publisher: University of Southern California, Tsinghua University 906 | - Task: Planning(Fine-tuning Pre-trained Model) 907 | - Project: [official](https://pointscoder.github.io/projects/gpt_driver/index.html) 908 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 909 | - Code: [GPT-Driver](https://github.com/PointsCoder/GPT-Driver) 910 | - Publish Date: 2023.10.02 911 | - Summary: 912 | - Motion planning as a language modeling problem. 913 | - Align the output of the LLM with human driving behavior through fine-tuning strategies using the OpenAI fine-tuning API. 914 | - Leverage the LLM to generate driving trajectories. 915 | - Metrics: 916 | - L2 metric and Collision rate 917 | 918 | - [GAIA-1: A Generative World Model for Autonomous Driving](https://arxiv.org/abs/2309.17080) 919 | - Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado 920 | - Publisher: Wayve 921 | - Task: Generation 922 | - Datasets: 923 | - Training dataset consists of 4,700 hours at 25Hz of proprietary driving data collected in London, 924 | UK between 2019 and 2023. It corresponds to approximately 420M unique images. 925 | - Validation dataset contains 400 hours of driving data from runs not included in the training set. 926 | - text coming from either online narration or offline metadata sources 927 | - Publish Date: 2023.09.29 928 | - Summary: 929 | - Introduce GAIA-1, a generative world model that leverages video(pre-trained DINO), text(T5-large), and action inputs to generate realistic driving scenarios. 930 | - Serve as a valuable neural simulator, allowing the generation of unlimited data. 931 | 932 | - [DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models](https://arxiv.org/abs/2309.16292) 933 | - Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao **ICLR 2024** 934 | - Publisher: Shanghai AI Laboratory, East China Normal University, The Chinese University of Hong Kong 935 | - Publish Date: 2023.09.28 936 | - Task: Planning 937 | - Env: 938 | - [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 939 | - [CitySim](https://github.com/ozheng1993/UCF-SST-CitySim-Dataset), a Drone-Based vehicle trajectory dataset. 940 | - Summary: 941 | - Propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. 942 | 943 | - [SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model](https://arxiv.org/abs/2309.13193) 944 | - Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong 945 | - Keywords: human-AI interaction, driver model, agent, generative AI, large language model, simulation framework 946 | - Env: [CARLA](https://github.com/carla-simulator) 947 | - Publisher: Tsinghua University 948 | - Summary: Propose a generative driver agent simulation framework based on large language models (LLMs), capable of perceiving complex traffic scenarios and providing realistic driving maneuvers. 949 | 950 | - [Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles](https://arxiv.org/abs/2309.10228) 951 | - Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang 952 | - Publisher: Purdue University, PediaMed.AI Lab, University of Virginia 953 | - Task: Planning 954 | - Publish Date: 2023.09.18 955 | - Summary: 956 | - Provide a comprehensive framework for integrating Large Language Models (LLMs) into AD. 957 | 958 | - [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) 959 | - Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiwen Lu **ECCV 2024** 960 | - Publisher: GigaAI, Tsinghua University 961 | - Task: Generation 962 | - Project Page: [official](https://drivedreamer.github.io/) 963 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 964 | - Publish Date: 2023.09.18 965 | - Summary: 966 | - Harness the powerful diffusion model to construct a comprehensive representation of the complex environment. 967 | - Generate future driving videos and driving policies by a multimodal(text, image, HDMap, Action, 3DBox) world model. 968 | 969 | - [Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving](https://arxiv.org/abs/2309.05282) 970 | - Ali Keysan, Andreas Look, Eitan Kosman, Gonca Gürsun, Jörg Wagner, Yu Yao, Barbara Rakitsch 971 | - Publisher: Bosch Center for Artificial Intelligence, University of Tubingen, 972 | - Task: Prediction 973 | - Datasets: [nuScenes](https://www.nuscenes.org/nuscenes) 974 | - Publish Date: 2023.09.13 975 | - Summary: 976 | - Integrating pre-trained language models as textbased input encoders for the AD trajectory prediction task. 977 | - Metrics: 978 | - minimum Average Displacement Error (minADEk) 979 | - Final Displacement Error (minFDEk) 980 | - MissRate over 2 meters 981 | 982 | - [TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models](https://arxiv.org/abs/2309.06719) 983 | - Siyao Zhang, Daocheng Fu, Zhao Zhang, Bin Yu, Pinlong Cai 984 | - Publisher: Beihang University, Key Laboratory of Intelligent Transportation Technology and System, Shanghai Artificial Intelligence Laboratory 985 | - Task: Planning 986 | - Code: [official](https://github.com/lijlansg/TrafficGPT.git) 987 | - Publish Date: 2023.09.13 988 | - Summary: 989 | - Present TrafficGPT—a fusion of ChatGPT and traffic foundation models. 990 | - Bridges the critical gap between large language models and traffic foundation models by defining a series of prompts. 991 | 992 | - [HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving](https://arxiv.org/abs/2309.05186) 993 | - Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li 994 | - Publisher: The Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab 995 | - Task: Detection + VQA 996 | - Datasets: [DRAMA](https://usa.honda-ri.com/drama) 997 | - Publish Date: 2023.09.11 998 | - Summary: 999 | - Propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task. 1000 | - ROLISP that aims to identify, explain and localize the risk object for the ego-vehicle meanwhile predicting its intention and giving suggestions. 1001 | - Metrics: 1002 | - LLM metrics, BLEU4, CIDEr and METETOR, SPICE. 1003 | - Detection metrics, mIoU, IoUs so on. 1004 | 1005 | - [Language Prompt for Autonomous Driving](https://arxiv.org/abs/2309.04379) 1006 | - Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen 1007 | - Publisher: Beijing Institute of Technology, University of Macau, MEGVII Technology, Beijing Academy of Artificial Intelligence 1008 | - Task: Tracking 1009 | - Code: [official](https://github.com/wudongming97/Prompt4Driving) 1010 | - Datasets: NuPrompt(not open), based on [nuScenes](https://www.nuscenes.org/nuscenes). 1011 | - Publish Date: 2023.09.08 1012 | - Summary: 1013 | - Propose a new large-scale language prompt set(based on nuScenes) for driving scenes, named NuPrompt(3D object-text pairs). 1014 | - Propose an efficient prompt-based tracking model with prompt reasoning modification on PFTrack, called PromptTrack. 1015 | 1016 | - [MTD-GPT: A Multi-Task Decision-Making GPT Model for Autonomous Driving at Unsignalized Intersections](https://arxiv.org/abs/2307.16118) 1017 | - Jiaqi Liu, Peng Hang, Xiao Qi, Jianqiang Wang, Jian Sun. *ITSC 2023* 1018 | - Publisher: Tongji University, Tsinghua University 1019 | - Task: Prediction 1020 | - Env: [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 1021 | - Publish Date: 2023.07.30 1022 | - Summary: 1023 | - Design a pipeline that leverages RL algorithms to train single-task decision-making experts and utilize expert data. 1024 | - Propose the MTD-GPT model for multi-task(left-turn, straight-through, right-turn) decision-making of AV at unsignalized intersections. 1025 | 1026 | - [Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain](https://arxiv.org/abs/2307.11769) 1027 | - Yun Tang, Antonio A. Bruto da Costa, Xizhe Zhang, Irvine Patrick, Siddartha Khastgir, Paul Jennings. *ITSC 2023* 1028 | - Publisher: University of Warwick 1029 | - Task: QA 1030 | - Publish Date: 2023.07.17 1031 | - Summary: 1032 | - Develop a web-based distillation assistant enabling supervision and flexible intervention at runtime by prompt engineering and the LLM ChatGPT. 1033 | 1034 | - [Drive Like a Human: Rethinking Autonomous Driving with Large Language Models](https://browse.arxiv.org/abs/2307.07162) 1035 | - Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao 1036 | - Publisher: Shanghai AI Lab, East China Normal University 1037 | - Task: Planning 1038 | - Code: [official](https://github.com/PJLab-ADG/DriveLikeAHuman) 1039 | - Env: [HighwayEnv](https://github.com/Farama-Foundation/HighwayEnv) 1040 | - Publish Date: 2023.07.14 1041 | - Summary: 1042 | - Identify three key abilities: Reasoning, Interpretation and Memorization(accumulate experience and self-reflection). 1043 | - Utilize LLM in AD as decision-making to solve long-tail corner cases and increase interpretability. 1044 | - Verify interpretability in closed-loop offline data. 1045 | 1046 | - [Language-Guided Traffic Simulation via Scene-Level Diffusion](https://arxiv.org/abs/2306.06344) 1047 | - Ziyuan Zhong, Davis Rempe, Yuxiao Chen, Boris Ivanovic, Yulong Cao, Danfei Xu, Marco Pavone, Baishakhi Ray 1048 | - Publisher: Columbia University, NVIDIA Research, Stanford University, Georgia Tech 1049 | - Task: Diffusion 1050 | - Publish Date: 2023.07.10 1051 | - Summary: 1052 | - Present CTG++, a language-guided scene-level conditional diffusion model for realistic query-compliant traffic simulation. 1053 | - Leverage an LLM for translating a user query into a differentiable loss function and propose a scene-level conditional diffusion model (with a spatial-temporal transformer architecture) to translate the loss function into realistic, query compliant trajectories. 1054 | 1055 | - [ADAPT: Action-aware Driving Caption Transformer](https://arxiv.org/abs/2302.00673) 1056 | - Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, Jingjing Liu **ICRA 2023** 1057 | - Publisher: Chinese Academy of Sciences, Tsinghua University, Peking University, Xidian University, Southern University of Science and Technology, Beihang University 1058 | - Code: [ADAPT](https://github.com/jxbbb/ADAPT) 1059 | - Datasets: [BDD-X dataset](https://github.com/JinkyuKimUCB/BDD-X-dataset) 1060 | - Summary: 1061 | - Propose ADAPT, a new end-to-end transformerbased action narration and reasoning framework for 1062 | self-driving vehicles. 1063 | - propose a multi-task joint training framework that aligns both the driving action captioning task and the control signal prediction task. 1064 |
1065 | 1066 | ## WorkShop 1067 |
1068 | Toggle 1069 | 1070 | - [Large Language and Vision Models for Autonomous Driving(LLVM-AD) Workshop @ WACV 2024](https://llvm-ad.github.io/) 1071 | - Publisher: Tencent Maps HD Map T.Lab, University of Illinois Urbana- Champaign, Purdue University, University of Virginia 1072 | - Challenge 1: MAPLM: A Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding 1073 | - Datasets: [Download](https://drive.google.com/drive/folders/1cqFjBH8MLeP6nKFM0l7oV-Srfke-Mx1R?usp=sharing) 1074 | - Task: QA 1075 | - Code: https://github.com/LLVM-AD/MAPLM 1076 | - Description: MAPLM combines point cloud BEV (Bird's Eye View) and panoramic images to provide a rich collection of road scenario images. It includes multi-level scene description data, which helps models navigate through complex and diverse traffic environments. 1077 | - Metric: 1078 | - Frame-overall-accuracy (FRM): A frame is considered correct if all closed-choice questions about it are answered correctly. 1079 | - Question-overall-accuracy (QNS): A question is considered correct if its answer is correct. 1080 | - LAN: How many lanes in current road? 1081 | - INT: Is there any road cross, intersection or lane change zone in the main road? 1082 | - QLT: What is the point cloud data quality in current road area of this image? 1083 | - SCN: What kind of road scene is it in the images? (SCN) 1084 | - Challenge 2: In-Cabin User Command Understanding (UCU) 1085 | - Datasets: [Download](https://github.com/LLVM-AD/ucu-dataset/blob/main/ucu.csv) 1086 | - Task: QA 1087 | - Code: https://github.com/LLVM-AD/ucu-dataset 1088 | - Description: 1089 | - This dataset focuses on understanding user commands in the context of autonomous vehicles. It contains 1,099 labeled commands. Each command is a sentence that describes a user’s request to the vehicle. 1090 | - Metric: 1091 | - Command-level accuracy: A command is considered correctly understood if all eight answers are correct. 1092 | - Question-level accuracy: Evaluation at the individual question level. 1093 |
1094 | 1095 | ## Datasets 1096 |
1097 | Toggle 1098 | 1099 | ``` 1100 | format: 1101 | - [title](dataset link) [links] 1102 | - author1, author2, and author3... 1103 | - keyword 1104 | - experiment environments or tasks 1105 | ``` 1106 | - [Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning](https://arxiv.org/abs/2309.06597) 1107 | - Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, Mykel Kochenderfer 1108 | - Publisher: Honda Research Institute, Stanford University 1109 | - Publish Date: 2023.09.10 1110 | - Summary: 1111 | - A multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. 1112 | - Introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset. 1113 | 1114 | - [DriveLM: Drive on Language](https://github.com/OpenDriveLab/DriveLM) 1115 | - Publisher: Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Luo, Ping and Geiger, Andreas and Li, Hongyang **ECCV 2024** 1116 | - Dataset: [DriveLM](https://github.com/OpenDriveLab/DriveLM/blob/main/docs/getting_started.md#download-data) 1117 | - Publish Date: 2023.08 1118 | - Summary: 1119 | - Construct dataset based on the nuScenes dataset. 1120 | - Perception questions require the model to recognize objects in the scene. 1121 | - Prediction questions ask the model to predict the future status of important objects in the scene. 1122 | - Planning questions prompt the model to give reasonable planning actions and avoid dangerous ones. 1123 | 1124 | - [WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models](https://browse.arxiv.org/abs/2305.07528) 1125 | - Aboli Marathe, Deva Ramanan, Rahee Walambe, Ketan Kotecha. **CVPR 2023** 1126 | - Publisher: Carnegie Mellon University, Symbiosis International University 1127 | - Dataset: [WEDGE](https://github.com/Infernolia/WEDGE) 1128 | - Publish Date: 2023.05.12 1129 | - Summary: 1130 | - A multi-weather autonomous driving dataset built from generative vision-language models. 1131 | 1132 | - [NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario](https://arxiv.org/abs/2305.14836) 1133 | - Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang 1134 | - Publisher: Fudan University 1135 | - Dataset: [NuScenes-QA](https://github.com/qiantianwen/NuScenes-QA) 1136 | - Summary: 1137 | - NuScenes-QA provides 459,941 question-answer pairs based on the 34,149 visual scenes, with 376,604 questions from 28,130 scenes used for training, and 83,337 questions from 6,019 scenes used for testing, respectively. 1138 | - The multi-view images and point clouds are first processed by the feature extraction backbone 1139 | to obtain BEV features. 1140 | 1141 | - [DRAMA: Joint Risk Localization and Captioning in Driving](https://arxiv.org/abs/2209.10767) 1142 | - Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, Jiachen Li 1143 | - Publisher: 1144 | - Datasets: [DRAMA](https://usa.honda-ri.com/drama#Introduction) 1145 | - Summary: 1146 | - Introduce a novel dataset DRAMA that provides linguistic descriptions (with the focus on reasons) of driving risks associated with important objects and that can be used to evaluate a range of visual captioning capabilities in driving scenarios. 1147 | 1148 | - [Language Prompt for Autonomous Driving](https://arxiv.org/abs/2309.04379) 1149 | - Datasets: Nuprompt(Not open) 1150 | - [Previous summary](#LanguagePrompt) 1151 | 1152 | - [Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving](https://browse.arxiv.org/abs/2310.01957) 1153 | - Datasets: [official](https://github.com/wayveai/Driving-with-LLMs/tree/main/data), data collection using RL experts in simulator. 1154 | - [Previous summary](#DrivingwithLLMs) 1155 | 1156 | - [Textual Explanations for Self-Driving Vehicles](https://arxiv.org/abs/1807.11546) 1157 | - Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata **ECCV 2018**. 1158 | - Publisher: University of California, Berkeley, Saarland Informatics Campus, University of Amsterdam 1159 | - [BDD-X dataset](https://github.com/JinkyuKimUCB/BDD-X-dataset) 1160 | 1161 | - [Grounding Human-To-Vehicle Advice for Self-Driving Vehicles](https://arxiv.org/abs/1911.06978) 1162 | - Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen, Ashish Tawari, John Canny **CVPR 2019** 1163 | - Publisher: UC Berkeley, Honda Research Institute USA, Inc. 1164 | - [HAD dataset](https://usa.honda-ri.com/had) 1165 |
1166 | 1167 | 1168 | ## License 1169 | 1170 | Awesome LLM for Autonomous Driving Resources is released under the Apache 2.0 license. 1171 | -------------------------------------------------------------------------------- /assets/llm4adpipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Thinklab-SJTU/Awesome-LLM4AD/9b30f29334a738f2ef20fe498967a278f08a31b6/assets/llm4adpipeline.png -------------------------------------------------------------------------------- /assets/whyllmenhance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Thinklab-SJTU/Awesome-LLM4AD/9b30f29334a738f2ef20fe498967a278f08a31b6/assets/whyllmenhance.png --------------------------------------------------------------------------------