World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving

Mingliang Zhai; Cheng Li; Zengyuan Guo; Ningrui Yang; Xiameng Qin; Sanyuan Zhao; Junyu Han; Ji Tao; Yuwei Wu; Yunde Jia

doi:10.1609/aaai.v39i9.33067

Authors

Mingliang Zhai Beijing Institute of Technology Shenzhen MSU-BIT University Chongqing Changan Automobile Co., Ltd.
Cheng Li Beijing Institute of Technology Chongqing Changan Automobile Co., Ltd.
Zengyuan Guo Chongqing Changan Automobile Co., Ltd.
Ningrui Yang Beijing Institute of Technology Chongqing Changan Automobile Co., Ltd.
Xiameng Qin Chongqing Changan Automobile Co., Ltd.
Sanyuan Zhao Beijing Institute of Technology
Junyu Han Chongqing Changan Automobile Co., Ltd.
Ji Tao Chongqing Changan Automobile Co., Ltd.
Yuwei Wu Shenzhen MSU-BIT University Beijing Institute of Technology
Yunde Jia Shenzhen MSU-BIT University

DOI:

https://doi.org/10.1609/aaai.v39i9.33067

Abstract

The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perception-limited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model’s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.

World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information