SimpleDiffusion: A Lightweight and Efficient Conditional Diffusion Model for Multi-Modal Salient Object Detection

Shuo Zhang; Jiaming Huang; Wenbing Tang; Jing Liu; LI HAN; Jiandun Li; Hongchun Yuan; Zizhu Fan

doi:10.1609/aaai.v40i15.38272

Authors

Shuo Zhang School of Electronic Information Engineering, Shanghai Dianji University
Jiaming Huang Technology Center, Huolala
Wenbing Tang College of Information Engineering, Northwest A&F University
Jing Liu Shanghai Key Laboratory of Trustworthy Computing, East China Normal University
LI HAN Shanghai Key Laboratory of Trustworthy Computing, East China Normal University
Jiandun Li School of Electronic Information Engineering, Shanghai Dianji University
Hongchun Yuan College of Information Technology, Shanghai Ocean University
Zizhu Fan College of Computer Science and Technology, Shanghai University of Electric Power

DOI:

https://doi.org/10.1609/aaai.v40i15.38272

Abstract

Multi-modal salient object detection (MSOD), which integrates complementary modalities such as depth or thermal data, primarily faces two challenges: accurately preserving salient object details and effectively aligning cross-modal features. Recent advances in using Stable Diffusion to generate images with fine edge details have inspired researchers to reformulate MSOD as a conditional mask generation process guided by salient features, which has achieved excellent visual results. However, these approaches often overlook the high computational cost and large-scale architecture of Stable Diffusion, both of which render it unsuitable for real-world MSOD applications. Therefore, we propose SimpleDiffusion, the first lightweight and efficient conditional diffusion model for MSOD that does not rely on Stable Diffusion. Specifically, we propose an Adaptive Cross-Modal Fusion Conditional Network and a Latent Denoising Network to reduce the complexity of diffusion models. Furthermore, we design a Multi-modal Feature Rectification and Fusion Module to enhance the representational capacity of cross-modal salient features. Customized training and sampling strategies are also developed to improve inference efficiency and reduce erroneous object segmentations. Experiments on multiple MSOD datasets demonstrate that SimpleDiffusion reduces model size by over tenfold and improves inference speed by more than fivefold compared to other diffusion-based methods, while maintaining comparable or superior performance.

SimpleDiffusion: A Lightweight and Efficient Conditional Diffusion Model for Multi-Modal Salient Object Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information