DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Shuo Zhang; Jiaming Huang; Wenbing Tang; Yan Wu; Terrence Hu; Xiaogang Xu; Jing Liu

doi:10.1609/aaai.v39i10.33096

Authors

Shuo Zhang Shanghai Key Laboratory of Trustworthy Computing, East China Normal University
Jiaming Huang Technology Center, Huolala
Wenbing Tang College of Computing and Data Science, Nanyang Technological University
Yan Wu Technology Center, Huolala
Terrence Hu Technology Center, Huolala
Xiaogang Xu The Chinese University of Hong Kong
Jing Liu Shanghai Key Laboratory of Trustworthy Computing, East China Normal University

DOI:

https://doi.org/10.1609/aaai.v39i10.33096

Abstract

Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.

DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information