Collaborative Transformers with Multi-Level Forensic Attention for Image Manipulation Localization

Authors

  • Jiwei Zhang School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, China Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism(BUPT), Beijing, China
  • Wenbo Feng School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, China
  • Siwei Wang The Intelligent Game and Decision Lab, Academy of Military Sciences, Beijing, China
  • Feifei Kou School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, China
  • Haoyang Yu China Mobile Internet Co., Ltd, GuangZhou, China
  • Shaozhang Niu School of Computer Science (National Pilot School of Software Engineering), BUPT, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i15.38250

Abstract

The proliferation of the tampered images on social media can pose serious societal risks, influencing public opinion and causing panic. Image Manipulation Localization technique has advanced to address this, but some methods focus on microscopic traces, overlooking macroscopic semantics that deceive viewers. To address this problem, we propose a novel Image Manipulation Localization framework called Collaborative Transformers (Co-Transformers), designed to fully explore and utilize the collaborative information between macroscopic semantics and microscopic traces. This framework is based on two Vision Transformer variants. The first variant captures the semantic logic of the image. The second variant delves into microscopic tampering traces. By dynamically fusing these two complementary features, the framework enables interaction between macroscopic semantic inconsistencies and microscopic abnormal traces, effectively coordinating their relationship in the latent space. Furthermore, we introduce a new Multi-Level Forensic Attention (MLF-Attention) mechanism to enhance the model's ability to extract various tampered traces, this mechanism can be integrated into our framework. Compared with existing methods, our proposed framework achieves state-of-the-art results in localization accuracy and shows good robustness against various attacks.

Downloads

Published

2026-03-14

How to Cite

Zhang, J., Feng, W., Wang, S., Kou, F., Yu, H., & Niu, S. (2026). Collaborative Transformers with Multi-Level Forensic Attention for Image Manipulation Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12556–12563. https://doi.org/10.1609/aaai.v40i15.38250

Issue

Section

AAAI Technical Track on Computer Vision XII