STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract)

Authors

  • Zhuo Chen Shenzhen International Graduate School, Tsinghua University
  • Haimei Zhao University of Sydney
  • Bo Yuan University of Queensland
  • Xiu Li Shenzhen International Graduate School, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v38i21.30429

Keywords:

Multi-camera Depth Estimation, Self-supervised Depth Estimation, Depth Estimation

Abstract

Multi-camera depth estimation has recently garnered significant attention due to its substantial practical implications in the realm of autonomous driving. In this paper, we delve into the task of self-supervised multi-camera depth estimation and propose an innovative framework, STViT, featuring several noteworthy enhancements: 1) we propose a Spatial-Temporal Transformer to comprehensively exploit both local connectivity and the global context of image features, meanwhile learning enriched spatial-temporal cross-view correlations to recover 3D geometry. 2) to alleviate the severe effect of adverse conditions, e.g., rainy weather and nighttime driving, we introduce a GAN-based Adversarial Geometry Regularization Module (AGR) to further constrain the depth estimation with unpaired normal-condition depth maps and prevent the model from being incorrectly trained. Experiments on challenging autonomous driving datasets Nuscenes and DDAD show that our method achieves state-of-the-art performance.

Published

2024-03-24

How to Cite

Chen, Z., Zhao, H., Yuan, B., & Li, X. (2024). STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23460-23461. https://doi.org/10.1609/aaai.v38i21.30429