SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation

Authors

  • Youhong Wang Northwestern Polytechnical University Bytedance
  • Yunji Liang Northwestern Polytechnical University
  • Hao Xu Bytedance
  • Shaohui Jiao Bytedance
  • Hongkai Yu Cleveland State University

DOI:

https://doi.org/10.1609/aaai.v38i6.28383

Keywords:

CV: Vision for Robotics & Autonomous Driving, CV: 3D Computer Vision

Abstract

Recently, self-supervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structure priors from ego-motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. We show that, the self-cost volume is an effective inductive bias for geometry learning, which implicitly models the single-frame scene geometry, with each slice of it indicating a relative distance map between points and objects in a latent space. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance, and showcases computational efficiency, reduced training complexity, and the ability to recover fine-grained scene details. Moreover, the self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code is available at https://github.com/hisfog/SfMNeXt-Impl.

Published

2024-03-24

How to Cite

Wang, Y., Liang, Y., Xu, H., Jiao, S., & Yu, H. (2024). SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5713-5721. https://doi.org/10.1609/aaai.v38i6.28383

Issue

Section

AAAI Technical Track on Computer Vision V