Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation

Authors

  • Tianyu Guo Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
  • Haowei Wang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
  • Yiwei Ma Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
  • Jiayi Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
  • Xiaoshuai Sun Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China

DOI:

https://doi.org/10.1609/aaai.v38i3.27969

Keywords:

CV: Multi-modal Vision, CV: Language and Vision

Abstract

Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG.

Published

2024-03-24

How to Cite

Guo, T., Wang, H., Ma, Y., Ji, J., & Sun, X. (2024). Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 1985-1993. https://doi.org/10.1609/aaai.v38i3.27969

Issue

Section

AAAI Technical Track on Computer Vision II