CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer
DOI:
https://doi.org/10.1609/aaai.v39i6.32717Abstract
Video Moment Retrieval (VMR) involves locating specific moments within a video based on natural language queries. However, existing VMR methods that employ various strategies for cross-modal alignment still face challenges such as limited understanding of fine-grained semantics, semantic overlap, and sparse constraints. To address these limitations, we propose a novel Concept Decomposition Transformer (CDTR) model for VMR. CDTR introduces a semantic concept decomposition module that disentangles video moments and sentence queries into concept representations, reflecting the relevance between various concepts and capturing fine-grained semantics which is crucial for cross-modal matching. These decomposed concept representations are then used as pseudo-labels, determined as positive or negative samples by adaptive concept-specific thresholds. Subsequently, fine-grained concept alignment is performed in video intra-modal and textual-visual cross-modal, aligning different conceptual components within features, enhancing the model's ability to distinguish fine-grained semantics, and alleviating issues related to semantic overlap and sparse constraints. Comprehensive experiments demonstrate the effectiveness of the CDTR, outperforming state-of-the-art methods on three widely used datasets: QVHighlights, Charades-STA, and TACoS.Downloads
Published
2025-04-11
How to Cite
Ran, R., Wei, J., Cai, X., Guan, X., Zou, J., Yang, Y., & Shen, H. T. (2025). CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 6684–6692. https://doi.org/10.1609/aaai.v39i6.32717
Issue
Section
AAAI Technical Track on Computer Vision V