DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning

Authors

  • Guojin Zhong College of Computer Science and Electronic Engineering, Hunan University
  • Jinhong Hu College of Computer Science and Electronic Engineering, Hunan University GuangDong Engineering Technology Research Center of Intelligent Service of Urban and Rural Planning and Construction
  • Jiajun Chen School of Robotics, Hunan University
  • Jin Yuan Hunan University
  • Wenbo Pan CRRC Zhuzhou Institute

DOI:

https://doi.org/10.1609/aaai.v39i10.33158

Abstract

Image change captioning (ICC) poses great challenges stemming from describing subtle differences between two similar images in natural language, significantly increasing the complexity of feature extraction and cross-modal learning compared to the image captioning task. Existing ICC methods often suffer from two key challenges: 1) Massive irrelevant information of uni-image features leads to suboptimal visual difference representations; 2) Imprecise inter-modality correspondence degrades the quality of generated captions. This paper proposes a Difference-aware Contrastive Diffusion Model with Adversarial Perturbations (DECIDER) for ICC due to the excellent performance of diffusion models in image/text generation. Technically, difference-aware cross-modal learning is developed to suppress irrelevant information and learn compact yet robust visual difference representations. This is achieved by optimizing a novel objective mathematically derived from the information bottleneck principle that excels in filtering redundant features and highlighting differences. Furthermore, we propose to dynamically generate ``hard'' positive and negative samples via adversarial perturbations, which are involved in contrastive diffusion training with a tighter variational bound. This design encourages our DECIDER to excavate and construct complex correspondences between visual differences and captions, thereby improving generalization performance. Extensive experiments on four datasets demonstrate that DECIDER significantly exceeds state-of-the-art performance.

Downloads

Published

2025-04-11

How to Cite

Zhong, G., Hu, J., Chen, J., Yuan, J., & Pan, W. (2025). DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10662–10670. https://doi.org/10.1609/aaai.v39i10.33158

Issue

Section

AAAI Technical Track on Computer Vision IX