Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Authors

  • Zhengfei Xu School of Computer Science and Technology, Beijing Institute of Technology
  • Sijia Zhao School of Computer Science and Technology, Beijing Institute of Technology
  • Yanchao Hao Platform and Content Group, Tencent
  • Xiaolong Liu Platform and Content Group, Tencent
  • Lili Li Platform and Content Group, Tencent
  • Yuyang Yin Platform and Content Group, Tencent
  • Bo Li Platform and Content Group, Tencent
  • Xi Chen Platform and Content Group, Tencent
  • Xin Xin School of Computer Science and Technology, Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i12.33416

Abstract

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

Published

2025-04-11

How to Cite

Xu, Z., Zhao, S., Hao, Y., Liu, X., Li, L., Yin, Y., … Xin, X. (2025). Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking. Proceedings of the AAAI Conference on Artificial Intelligence, 39(12), 12981–12989. https://doi.org/10.1609/aaai.v39i12.33416

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management II