D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation
DOI:
https://doi.org/10.1609/aaai.v40i12.37970Abstract
Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation (D³-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that D³-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40× speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.Published
2026-03-14
How to Cite
Wang, R., Li, W., Feng, Z., Zhang, H., Song, M., Wang, J., … Sun, L. (2026). D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10038–10046. https://doi.org/10.1609/aaai.v40i12.37970
Issue
Section
AAAI Technical Track on Computer Vision IX