MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization

Authors

  • Kunlun Zeng School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
  • Ri Cheng School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
  • Weimin Tan School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
  • Bo Yan School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University

DOI:

https://doi.org/10.1609/aaai.v38i7.28520

Keywords:

CV: Segmentation, CV: Object Detection & Categorization

Abstract

Deep learning-based models have made great progress in image tampering localization, which aims to distinguish between manipulated and authentic regions. However, these models suffer from inefficient training. This is because they use ground-truth mask labels mainly through the cross-entropy loss, which prioritizes per-pixel precision but disregards the spatial location and shape details of manipulated regions. To address this problem, we propose a Mask-Guided Query-based Transformer Framework (MGQFormer), which uses ground-truth masks to guide the learnable query token (LQT) in identifying the forged regions. Specifically, we extract feature embeddings of ground-truth masks as the guiding query token (GQT) and feed GQT and LQT into MGQFormer to estimate fake regions, respectively. Then we make MGQFormer learn the position and shape information in ground-truth mask labels by proposing a mask-guided loss to reduce the feature distance between GQT and LQT. We also observe that such mask-guided training strategy has a significant impact on the convergence speed of MGQFormer training. Extensive experiments on multiple benchmarks show that our method significantly improves over state-of-the-art methods.

Published

2024-03-24

How to Cite

Zeng, K., Cheng, R., Tan, W., & Yan, B. (2024). MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 6944–6952. https://doi.org/10.1609/aaai.v38i7.28520

Issue

Section

AAAI Technical Track on Computer Vision VI