D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Authors

  • Ruizhi Wang School of Software Technology, Zhejiang University
  • Weihan Li School of Software Technology, Zhejiang University
  • Zunlei Feng School of Software Technology, Zhejiang University State Key Laboratory of Blockchain and Data Security, Zhejiang University
  • Haofei Zhang State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Mingli Song School of Software Technology, Zhejiang University; State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Jiayu Wang College of Computer Science and Technology, Zhejiang University
  • Jie Song School of Software Technology, Zhejiang University
  • Li Sun Ningbo Global Innovation Center, Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i12.37970

Abstract

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation (D³-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that D³-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40× speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

Downloads

Published

2026-03-14

How to Cite

Wang, R., Li, W., Feng, Z., Zhang, H., Song, M., Wang, J., … Sun, L. (2026). D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10038–10046. https://doi.org/10.1609/aaai.v40i12.37970

Issue

Section

AAAI Technical Track on Computer Vision IX