BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

Authors

  • Tao Tang Shenzhen Campus of Sun Yat-sen University
  • Dafeng Wei Li Auto Inc.
  • Zhengyu Jia Li Auto Inc.
  • Tian Gao Li Auto Inc.
  • Changwei Cai Li Auto Inc.
  • Chengkai Hou Li Auto Inc.
  • Peng Jia Li Auto Inc.
  • Kun Zhan Li Auto Inc.
  • Haiyang Sun Li Auto Inc.
  • Fan JingChen Li Auto Inc.
  • Yixing Zhao Li Auto Inc.
  • Xiaodan Liang Shenzhen Campus of Sun Yat-sen University
  • Xianpeng Lang Li Auto Inc.
  • Yang Wang Li Auto Inc.

DOI:

https://doi.org/10.1609/aaai.v39i7.32782

Abstract

The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird’s Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-test and text-to-scene retrieval respectively.

Downloads

Published

2025-04-11

How to Cite

Tang, T., Wei, D., Jia, Z., Gao, T., Cai, C., Hou, C., … Wang, Y. (2025). BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7275–7283. https://doi.org/10.1609/aaai.v39i7.32782

Issue

Section

AAAI Technical Track on Computer Vision VI