Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Zhengfei Xu; Sijia Zhao; Yanchao Hao; Xiaolong Liu; Lili Li; Yuyang Yin; Bo Li; Xi Chen; Xin Xin

doi:10.1609/aaai.v39i12.33416

Authors

Zhengfei Xu School of Computer Science and Technology, Beijing Institute of Technology
Sijia Zhao School of Computer Science and Technology, Beijing Institute of Technology
Yanchao Hao Platform and Content Group, Tencent
Xiaolong Liu Platform and Content Group, Tencent
Lili Li Platform and Content Group, Tencent
Yuyang Yin Platform and Content Group, Tencent
Bo Li Platform and Content Group, Tencent
Xi Chen Platform and Content Group, Tencent
Xin Xin School of Computer Science and Technology, Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i12.33416

Abstract

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information