CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding
DOI:
https://doi.org/10.1609/aaai.v38i4.28118Keywords:
CV: Language and Vision, CV: Object Detection & Categorization, CV: Video Understanding & Activity AnalysisAbstract
This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.Downloads
Published
2024-03-24
How to Cite
Liang, Y., Liang, X., Tang, Y., Yang, Z., Li, Z., Wang, J., … Huang, S.-L. (2024). CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3324–3332. https://doi.org/10.1609/aaai.v38i4.28118
Issue
Section
AAAI Technical Track on Computer Vision III