Curriculum Multi-Negative Augmentation for Debiased Video Grounding
DOI:
https://doi.org/10.1609/aaai.v37i1.25204Keywords:
CV: Video Understanding & Activity Analysis, CV: Language and Vision, CV: Multi-modal VisionAbstract
Video Grounding (VG) aims to locate the desired segment from a video given a sentence query. Recent studies have found that current VG models are prone to over-rely the groundtruth moment annotation distribution biases in the training set. To discourage the standard VG model's behavior of exploiting such temporal annotation biases and improve the model generalization ability, we propose multiple negative augmentations in a hierarchical way, including cross-video augmentations from clip-/video-level, and self-shuffled augmentations with masks. These augmentations can effectively diversify the data distribution so that the model can make more reasonable predictions instead of merely fitting the temporal biases. However, directly adopting such data augmentation strategy may inevitably carry some noise shown in our cases, since not all of the handcrafted augmentations are semantically irrelevant to the groundtruth video. To further denoise and improve the grounding accuracy, we design a multi-stage curriculum strategy to adaptively train the standard VG model from easy to hard negative augmentations. Experiments on newly collected Charades-CD and ActivityNet-CD datasets demonstrate our proposed strategy can improve the performance of the base model on both i.i.d and o.o.d scenarios.Downloads
Published
2023-06-26
How to Cite
Lan, X., Yuan, Y., Chen, H., Wang, X., Jie, Z., Ma, L., Wang, Z., & Zhu, W. (2023). Curriculum Multi-Negative Augmentation for Debiased Video Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 1213-1221. https://doi.org/10.1609/aaai.v37i1.25204
Issue
Section
AAAI Technical Track on Computer Vision I