Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling

Yixin Wu; Feiran Zhang; Tianyuan Shi; Ruicheng Yin; Zhenghua Wang; Zhenliang Gan; Xiaohua Wang; Changze Lv; Xiaoqing Zheng; Xuanjing Huang

doi:10.1609/aaai.v40i13.38060

Authors

Yixin Wu College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Feiran Zhang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Tianyuan Shi College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Ruicheng Yin College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Zhenghua Wang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Zhenliang Gan College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Xiaohua Wang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Changze Lv College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Xiaoqing Zheng College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing
Xuanjing Huang College of Computer Science and Artificial Intelligence, Fudan University Shanghai Key Laboratory of Intelligent Information Processing IEIT System Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v40i13.38060

Abstract

Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, taking the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel detection method named ESIDE that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing the overtime of conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation refinement module to identify and explain AI-generated flaws. Additionally, we present the benchmarks GenHard and GenExplain, offering detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that ESIDE achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness.

Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information