Action-and-object Aware Alignment for Partially Relevant Video Retrieval
DOI:
https://doi.org/10.1609/aaai.v40i4.37271Abstract
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments for a given text query. This task is extremely challenging, as untrimmed videos often include numerous actions and objects unrelated to the query. However, existing methods usually struggle with fine-grained action-object modeling, limiting their retrieval performance. To tackle this challenge, we introduce Action-and-object Aware Alignment for Partially Relevant Video Retrieval (A3PRVR), a dual-branch framework designed to enhance retrieval by improving the modeling of action-object relationships. Specifically, we propose a Query-specific Deformable Temporal Attention (Q-DTA) module to effectively capture action-relevant object information in video features, while filtering out irrelevant content. Additionally, we propose an action-and-object aware alignment module to enable fine-grained textual understanding and video-text alignment. It uses action- and object-aware contrastive losses to enhance the model's sensitivity to action-object distinctions in the text query. Compared to state-of-the-art methods, A3PRVR achieves an average relative gain of 6.5% in SumR across the Charades-STA, ActivityNet-Caption, and TVR datasets.Published
2026-03-14
How to Cite
Chen, C., Zhou, K., Wen, Z., You, Z., Li, Y., Xiang, T., & Tan, M. (2026). Action-and-object Aware Alignment for Partially Relevant Video Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2814–2822. https://doi.org/10.1609/aaai.v40i4.37271
Issue
Section
AAAI Technical Track on Computer Vision I