RobusTor3D: Robust Multimodal 3D Object Detector for Autonomous Driving by Vision-Language Knowledge Blending

Authors

  • Ying Yang State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University School of Computer Science & Technology, Beijing Jiaotong University
  • Hui Yin State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University Frontiers Science Center for Smart High-speed Railway System, Beijing Jiaotong University
  • Aixin Chong School of Computer Science & Technology, Shandong University of Technology
  • Hui Wang School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast
  • Zhengyin Liang School of Computer Science & Technology, Beijing Jiaotong University Key Laboratory of Beijing for Railway Engineering, Beijing Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i14.38163

Abstract

Multimodal 3D object detection for autonomous driving, a task for real-world applications, poses substantial challenges in maintaining robust performance under various perturbations and complex environmental conditions. However, most existing approaches primarily focus on performance optimization under relatively ideal scenarios or focus on one or few disturbing conditions (or adverse conditions), lacking systematic exploration of robustness against real-world factors, including high class imbalance, adverse weather conditions, sensor jitter and failures, and significant scene variations. To address this issue, we propose a robust multimodal 3D detector, termed RobusTor3D, which integrates robustness at both the structural and supervisory levels by blending the knowledge from Vision-Language Models (VLMs). Structurally, textual descriptions are incorporated to enhance the semantic richness and diversity of rare classes. This novel semantic injection operation compensates for the inherent class imbalance and modality weakness in conventional visual features. Furthermore, semantic alignment capability and robust representation by Vision-Language Knowledge Extraction (V-LKE) serve as semantic priors to complement modality-specific representations, significantly improving model adaptability. At the supervisory level, we propose a Scene-level Multimodal Consistency Learning (SMCL) strategy, which jointly enforces global semantic constraints across modalities, encouraging the learning of stable and abundant semantic representations. This special design specifically reduces the impact of spatial alignment, while notably enabling semantic compensation under modality-loss conditions. Extensive robustness experiments conducted on KITTI, KITTI-C, and CADC benchmarks evaluate five robustness aspects, including long-tail problem, adverse weather (rain, snow, fog, strong sunlight), sensor spatial misalignment and motion blur, modality loss, and cross-domain scenarios. The results show that RobusTor3D demonstrates superior robustness across all five evaluated aspects. It consistently outperforms the state-of-the-art methods under various challenging conditions.

Downloads

Published

2026-03-14

How to Cite

Yang, Y., Yin, H., Chong, A., Wang, H., & Liang, Z. (2026). RobusTor3D: Robust Multimodal 3D Object Detector for Autonomous Driving by Vision-Language Knowledge Blending. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11775-11783. https://doi.org/10.1609/aaai.v40i14.38163

Issue

Section

AAAI Technical Track on Computer Vision XI