EchoDiffusion: Waveform Conditioned Diffusion Models for Echo-Based Depth Estimation

Authors

  • Wenjie Zhang School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;
  • Jun Yin School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China;
  • Long Ma School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China;
  • Peng Yu School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China;
  • Xiaoheng Jiang School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;
  • Zhen Tian School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;
  • Mingliang Xu School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China; National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China;

DOI:

https://doi.org/10.1609/aaai.v39i21.34416

Abstract

To extract spatial information, depth estimation using conventional echo-based methods typically employs models with encoder-decoder architectures, such as UNet. However, these methods may face challenges in extracting fine details from echo waveforms and handling multi-scale feature extraction with high precision. To address these challenges, we introduce EchoDiffusion, a framework that incorporates diffusion models conditioned on waveform embeddings for echo-based depth estimation. This framework employs the Multi-Scale Adaptive Latent Feature Network (MALF-Net) to extract multi-scale spatial features and perform adaptive fusion, encoding the echo spectrograms into the latent space. Additionally, we propose the Echo Waveform Detail Embedder (EWDE), which leverages a pre-trained Wav2Vec model to extract detailed spatial information from echo waveforms, using these details as conditional inputs to guide the reverse diffusion process in the latent space. By embedding the echo waveforms into the reverse diffusion process, we can more accurately guide the generation of depth maps. Our extensive evaluations on the Replica and Matterport3D datasets demonstrate that EchoDiffusion establishes new benchmarks for state-of-the-art performance in echo-based depth estimation.

Downloads

Published

2025-04-11

How to Cite

Zhang, W., Yin, J., Ma, L., Yu, P., Jiang, X., Tian, Z., & Xu, M. (2025). EchoDiffusion: Waveform Conditioned Diffusion Models for Echo-Based Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 22578–22586. https://doi.org/10.1609/aaai.v39i21.34416

Issue

Section

AAAI Technical Track on Machine Learning VII