SimpleDiffusion: A Lightweight and Efficient Conditional Diffusion Model for Multi-Modal Salient Object Detection

Authors

  • Shuo Zhang School of Electronic Information Engineering, Shanghai Dianji University
  • Jiaming Huang Technology Center, Huolala
  • Wenbing Tang College of Information Engineering, Northwest A&F University
  • Jing Liu Shanghai Key Laboratory of Trustworthy Computing, East China Normal University
  • LI HAN Shanghai Key Laboratory of Trustworthy Computing, East China Normal University
  • Jiandun Li School of Electronic Information Engineering, Shanghai Dianji University
  • Hongchun Yuan College of Information Technology, Shanghai Ocean University
  • Zizhu Fan College of Computer Science and Technology, Shanghai University of Electric Power

DOI:

https://doi.org/10.1609/aaai.v40i15.38272

Abstract

Multi-modal salient object detection (MSOD), which integrates complementary modalities such as depth or thermal data, primarily faces two challenges: accurately preserving salient object details and effectively aligning cross-modal features. Recent advances in using Stable Diffusion to generate images with fine edge details have inspired researchers to reformulate MSOD as a conditional mask generation process guided by salient features, which has achieved excellent visual results. However, these approaches often overlook the high computational cost and large-scale architecture of Stable Diffusion, both of which render it unsuitable for real-world MSOD applications. Therefore, we propose SimpleDiffusion, the first lightweight and efficient conditional diffusion model for MSOD that does not rely on Stable Diffusion. Specifically, we propose an Adaptive Cross-Modal Fusion Conditional Network and a Latent Denoising Network to reduce the complexity of diffusion models. Furthermore, we design a Multi-modal Feature Rectification and Fusion Module to enhance the representational capacity of cross-modal salient features. Customized training and sampling strategies are also developed to improve inference efficiency and reduce erroneous object segmentations. Experiments on multiple MSOD datasets demonstrate that SimpleDiffusion reduces model size by over tenfold and improves inference speed by more than fivefold compared to other diffusion-based methods, while maintaining comparable or superior performance.

Downloads

Published

2026-03-14

How to Cite

Zhang, S., Huang, J., Tang, W., Liu, J., HAN, L., Li, J., … Fan, Z. (2026). SimpleDiffusion: A Lightweight and Efficient Conditional Diffusion Model for Multi-Modal Salient Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12753–12761. https://doi.org/10.1609/aaai.v40i15.38272

Issue

Section

AAAI Technical Track on Computer Vision XII