Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

Authors

  • Jielong Tang School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China
  • Zhenxing Wang State Key Laboratory of Intelligent Game, Institute of Software, Chinese Academy of Sciences, Beijing, China
  • ZiYang Gong School of Atmospheric Sciences, Sun Yat-sen University, Zhuhai, China
  • Jianxing Yu School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China Pazhou Lab, Guangzhou, China
  • Xiangwei Zhu School of Electronics and Communication Engineering, Sun Yat-sen University, Guangzhou, China
  • Jian Yin School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China

DOI:

https://doi.org/10.1609/aaai.v39i24.34711

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.

Downloads

Published

2025-04-11

How to Cite

Tang, J., Wang, Z., Gong, Z., Yu, J., Zhu, X., & Yin, J. (2025). Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25246–25254. https://doi.org/10.1609/aaai.v39i24.34711

Issue

Section

AAAI Technical Track on Natural Language Processing III