Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling
DOI:
https://doi.org/10.1609/aaai.v40i13.38060Abstract
Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, taking the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel detection method named ESIDE that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing the overtime of conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation refinement module to identify and explain AI-generated flaws. Additionally, we present the benchmarks GenHard and GenExplain, offering detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that ESIDE achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness.Downloads
Published
2026-03-14
How to Cite
Wu, Y., Zhang, F., Shi, T., Yin, R., Wang, Z., Gan, Z., … Huang, X. (2026). Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling. Proceedings of the AAAI Conference on Artificial Intelligence, 40(13), 10844–10852. https://doi.org/10.1609/aaai.v40i13.38060
Issue
Section
AAAI Technical Track on Computer Vision X