Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation
DOI:
https://doi.org/10.1609/aaai.v38i3.27969Keywords:
CV: Multi-modal Vision, CV: Language and VisionAbstract
Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG.Downloads
Published
2024-03-24
How to Cite
Guo, T., Wang, H., Ma, Y., Ji, J., & Sun, X. (2024). Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 1985-1993. https://doi.org/10.1609/aaai.v38i3.27969
Issue
Section
AAAI Technical Track on Computer Vision II