Curriculum Multi-Negative Augmentation for Debiased Video Grounding


  • Xiaohan Lan Tsinghua University
  • Yitian Yuan Meituan Inc.
  • Hong Chen Tsinghua University
  • Xin Wang Tsinghua University
  • Zequn Jie Meituan Inc.
  • Lin Ma Meituan Inc.
  • Zhi Wang Tsinghua University
  • Wenwu Zhu Tsinghua University



CV: Video Understanding & Activity Analysis, CV: Language and Vision, CV: Multi-modal Vision


Video Grounding (VG) aims to locate the desired segment from a video given a sentence query. Recent studies have found that current VG models are prone to over-rely the groundtruth moment annotation distribution biases in the training set. To discourage the standard VG model's behavior of exploiting such temporal annotation biases and improve the model generalization ability, we propose multiple negative augmentations in a hierarchical way, including cross-video augmentations from clip-/video-level, and self-shuffled augmentations with masks. These augmentations can effectively diversify the data distribution so that the model can make more reasonable predictions instead of merely fitting the temporal biases. However, directly adopting such data augmentation strategy may inevitably carry some noise shown in our cases, since not all of the handcrafted augmentations are semantically irrelevant to the groundtruth video. To further denoise and improve the grounding accuracy, we design a multi-stage curriculum strategy to adaptively train the standard VG model from easy to hard negative augmentations. Experiments on newly collected Charades-CD and ActivityNet-CD datasets demonstrate our proposed strategy can improve the performance of the base model on both i.i.d and o.o.d scenarios.




How to Cite

Lan, X., Yuan, Y., Chen, H., Wang, X., Jie, Z., Ma, L., Wang, Z., & Zhu, W. (2023). Curriculum Multi-Negative Augmentation for Debiased Video Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 1213-1221.



AAAI Technical Track on Computer Vision I