CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer

Ran Ran; Jiwei Wei; Xiangyi Cai; Xiang Guan; Jie Zou; Yang Yang; Heng Tao Shen

doi:10.1609/aaai.v39i6.32717

Authors

Ran Ran University of Electronic Science and Technology of China
Jiwei Wei University of Electronic Science and Technology of China
Xiangyi Cai University of Electronic Science and Technology of China
Xiang Guan University of Electronic Science and Technology of China
Jie Zou University of Electronic Science and Technology of China
Yang Yang University of Electronic Science and Technology of China
Heng Tao Shen University of Electronic Science and Technology of China Tongji University

DOI:

https://doi.org/10.1609/aaai.v39i6.32717

Abstract

Video Moment Retrieval (VMR) involves locating specific moments within a video based on natural language queries. However, existing VMR methods that employ various strategies for cross-modal alignment still face challenges such as limited understanding of fine-grained semantics, semantic overlap, and sparse constraints. To address these limitations, we propose a novel Concept Decomposition Transformer (CDTR) model for VMR. CDTR introduces a semantic concept decomposition module that disentangles video moments and sentence queries into concept representations, reflecting the relevance between various concepts and capturing fine-grained semantics which is crucial for cross-modal matching. These decomposed concept representations are then used as pseudo-labels, determined as positive or negative samples by adaptive concept-specific thresholds. Subsequently, fine-grained concept alignment is performed in video intra-modal and textual-visual cross-modal, aligning different conceptual components within features, enhancing the model's ability to distinguish fine-grained semantics, and alleviating issues related to semantic overlap and sparse constraints. Comprehensive experiments demonstrate the effectiveness of the CDTR, outperforming state-of-the-art methods on three widely used datasets: QVHighlights, Charades-STA, and TACoS.

CDTR: Semantic Alignment for Video Moment Retrieval Using Concept Decomposition Transformer

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information