SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

Authors

  • Zhaoxin Fan Key Laboratory of Data Engineering and Knowledge Engineering of MOE, School of Information, Renmin University of China, 100872, Beijing, China
  • Zhenbo Song School of Computer Science and Engineering, Nanjing University of Science and Technology, 210094, Nanjing, China
  • Hongyan Liu Department of Management Science and Engineering, Tsinghua University, 100084, Beijing, China
  • Zhiwu Lu Gaoling School of Artificial Intelligence, Renmin University of China, 100872, Beijing, China
  • Jun He Key Laboratory of Data Engineering and Knowledge Engineering of MOE, School of Information, Renmin University of China, 100872, Beijing, China
  • Xiaoyong Du Key Laboratory of Data Engineering and Knowledge Engineering of MOE, School of Information, Renmin University of China, 100872, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v36i1.19934

Keywords:

Computer Vision (CV)

Abstract

Simultaneous Localization and Mapping (SLAM) and Autonomous Driving are becoming increasingly more important in recent years. Point cloud-based large scale place recognition is the spine of them. While many models have been proposed and have achieved acceptable performance by learning short-range local features, they always skip long-range contextual properties. Moreover, the model size also becomes a serious shackle for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net. On top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed respectively to learn both short-range local features and long-range contextual features. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art performance in terms of both recognition accuracy and running speed with a super-light model size (0.9M parameters). Meanwhile, for the purpose of further boosting efficiency, we introduce two simplified versions, which also achieve state-of-the-art performance and further reduce the model size to 0.8M and 0.4M respectively.

Downloads

Published

2022-06-28

How to Cite

Fan, Z., Song, Z., Liu, H., Lu, Z., He, J., & Du, X. (2022). SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 36(1), 551-560. https://doi.org/10.1609/aaai.v36i1.19934

Issue

Section

AAAI Technical Track on Computer Vision I