DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Authors

  • Shuo Zhang Shanghai Key Laboratory of Trustworthy Computing, East China Normal University
  • Jiaming Huang Technology Center, Huolala
  • Wenbing Tang College of Computing and Data Science, Nanyang Technological University
  • Yan Wu Technology Center, Huolala
  • Terrence Hu Technology Center, Huolala
  • Xiaogang Xu The Chinese University of Hong Kong
  • Jing Liu Shanghai Key Laboratory of Trustworthy Computing, East China Normal University

DOI:

https://doi.org/10.1609/aaai.v39i10.33096

Abstract

Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.

Downloads

Published

2025-04-11

How to Cite

Zhang, S., Huang, J., Tang, W., Wu, Y., Hu, T., Xu, X., & Liu, J. (2025). DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10103–10111. https://doi.org/10.1609/aaai.v39i10.33096

Issue

Section

AAAI Technical Track on Computer Vision IX