DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning
DOI:
https://doi.org/10.1609/aaai.v39i10.33158Abstract
Image change captioning (ICC) poses great challenges stemming from describing subtle differences between two similar images in natural language, significantly increasing the complexity of feature extraction and cross-modal learning compared to the image captioning task. Existing ICC methods often suffer from two key challenges: 1) Massive irrelevant information of uni-image features leads to suboptimal visual difference representations; 2) Imprecise inter-modality correspondence degrades the quality of generated captions. This paper proposes a Difference-aware Contrastive Diffusion Model with Adversarial Perturbations (DECIDER) for ICC due to the excellent performance of diffusion models in image/text generation. Technically, difference-aware cross-modal learning is developed to suppress irrelevant information and learn compact yet robust visual difference representations. This is achieved by optimizing a novel objective mathematically derived from the information bottleneck principle that excels in filtering redundant features and highlighting differences. Furthermore, we propose to dynamically generate ``hard'' positive and negative samples via adversarial perturbations, which are involved in contrastive diffusion training with a tighter variational bound. This design encourages our DECIDER to excavate and construct complex correspondences between visual differences and captions, thereby improving generalization performance. Extensive experiments on four datasets demonstrate that DECIDER significantly exceeds state-of-the-art performance.Published
2025-04-11
How to Cite
Zhong, G., Hu, J., Chen, J., Yuan, J., & Pan, W. (2025). DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10662–10670. https://doi.org/10.1609/aaai.v39i10.33158
Issue
Section
AAAI Technical Track on Computer Vision IX