Commonsense for Zero-Shot Natural Language Video Localization

Authors

  • Meghana Holla Virginia Tech
  • Ismini Lourentzou University of Illinois at Urbana-Champaign

DOI:

https://doi.org/10.1609/aaai.v38i3.27989

Keywords:

CV: Image and Video Retrieval, CV: Video Understanding & Activity Analysis, ML: Multimodal Learning, CV: Multi-modal Vision

Abstract

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

Published

2024-03-24

How to Cite

Holla, M., & Lourentzou, I. (2024). Commonsense for Zero-Shot Natural Language Video Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2166–2174. https://doi.org/10.1609/aaai.v38i3.27989

Issue

Section

AAAI Technical Track on Computer Vision II