Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

Xiao Wang; Wentao Wu; Chenglong Li; Zhicheng Zhao; Zhe Chen; Yukai Shi; Jin Tang

doi:10.1609/aaai.v38i6.28373

Authors

Xiao Wang Anhui University
Wentao Wu Anhui University
Chenglong Li Anhui University
Zhicheng Zhao Anhui University
Zhe Chen La Trobe University
Yukai Shi Guangdong University of Technology
Jin Tang Anhui University

DOI:

https://doi.org/10.1609/aaai.v38i6.28373

Keywords:

CV: Applications, CV: Language and Vision, CV: Large Vision Models, CV: Interpretability, Explainability, and Transparency

Abstract

Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information