DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning

Guojin Zhong; Jinhong Hu; Jiajun Chen; Jin Yuan; Wenbo Pan

doi:10.1609/aaai.v39i10.33158

Authors

Guojin Zhong College of Computer Science and Electronic Engineering, Hunan University
Jinhong Hu College of Computer Science and Electronic Engineering, Hunan University GuangDong Engineering Technology Research Center of Intelligent Service of Urban and Rural Planning and Construction
Jiajun Chen School of Robotics, Hunan University
Jin Yuan Hunan University
Wenbo Pan CRRC Zhuzhou Institute

DOI:

https://doi.org/10.1609/aaai.v39i10.33158

Abstract

Image change captioning (ICC) poses great challenges stemming from describing subtle differences between two similar images in natural language, significantly increasing the complexity of feature extraction and cross-modal learning compared to the image captioning task. Existing ICC methods often suffer from two key challenges: 1) Massive irrelevant information of uni-image features leads to suboptimal visual difference representations; 2) Imprecise inter-modality correspondence degrades the quality of generated captions. This paper proposes a Difference-aware Contrastive Diffusion Model with Adversarial Perturbations (DECIDER) for ICC due to the excellent performance of diffusion models in image/text generation. Technically, difference-aware cross-modal learning is developed to suppress irrelevant information and learn compact yet robust visual difference representations. This is achieved by optimizing a novel objective mathematically derived from the information bottleneck principle that excels in filtering redundant features and highlighting differences. Furthermore, we propose to dynamically generate ``hard'' positive and negative samples via adversarial perturbations, which are involved in contrastive diffusion training with a tighter variational bound. This design encourages our DECIDER to excavate and construct complex correspondences between visual differences and captions, thereby improving generalization performance. Extensive experiments on four datasets demonstrate that DECIDER significantly exceeds state-of-the-art performance.

DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information