Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling

Authors

  • Yixin Wu College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Feiran Zhang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Tianyuan Shi College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Ruicheng Yin College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Zhenghua Wang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Zhenliang Gan College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Xiaohua Wang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Changze Lv College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Xiaoqing Zheng College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
  • Xuanjing Huang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing IEIT System Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v40i13.38060

Abstract

Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, taking the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel detection method named ESIDE that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing the overtime of conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation refinement module to identify and explain AI-generated flaws. Additionally, we present the benchmarks GenHard and GenExplain, offering detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that ESIDE achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness.

Published

2026-03-14

How to Cite

Wu, Y., Zhang, F., Shi, T., Yin, R., Wang, Z., Gan, Z., … Huang, X. (2026). Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling. Proceedings of the AAAI Conference on Artificial Intelligence, 40(13), 10844–10852. https://doi.org/10.1609/aaai.v40i13.38060

Issue

Section

AAAI Technical Track on Computer Vision X