D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Ruizhi Wang; Weihan Li; Zunlei Feng; Haofei Zhang; Mingli Song; Jiayu Wang; Jie Song; Li Sun

doi:10.1609/aaai.v40i12.37970

Authors

Ruizhi Wang School of Software Technology, Zhejiang University
Weihan Li School of Software Technology, Zhejiang University
Zunlei Feng School of Software Technology, Zhejiang University State Key Laboratory of Blockchain and Data Security, Zhejiang University
Haofei Zhang State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Mingli Song School of Software Technology, Zhejiang University; State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Jiayu Wang College of Computer Science and Technology, Zhejiang University
Jie Song School of Software Technology, Zhejiang University
Li Sun Ningbo Global Innovation Center, Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i12.37970

Abstract

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation (D³-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that D³-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40× speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information