Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization

Xiongwen Deng; Haoyu Tang; Han Jiang; Qinghai Zheng; Jihua Zhu

doi:10.1609/aaai.v39i3.32276

Authors

Xiongwen Deng School of Software Engineering, Xi’an Jiaotong University School of Software, Shandong University
Haoyu Tang School of Software, Shandong University
Han Jiang School of Software Engineering, Xi’an Jiaotong University
Qinghai Zheng College of Computer and Data Science, Fuzhou University
Jihua Zhu School of Software Engineering, Xi’an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v39i3.32276

Abstract

Zero-shot Natural Language Video Localization (NLVL) aims to automatically generate moments and corresponding pseudo queries from raw videos for the training of the localization model without any manual annotations. Existing approaches typically produce pseudo queries as simple words, which overlook the complexity of queries in real-world scenarios. Considering the powerful text modeling capabilities of large language models (LLMs), leveraging LLMs to generate complete queries that are closer to human descriptions is a potential solution. However, directly integrating LLMs into existing approaches introduces several issues, including insensitivity, isolation, and lack of regulation, which prevent the full exploitation of LLMs to enhance zero-shot NLVL performance. To address these issues, we propose BTDP, an innovative framework for Boundary-aware Temporal Dynamic Pseudo-supervision pairs generation. Our method contains two crucial operations: 1) Boundary Segmentation that identifies both visual boundaries and semantic boundaries to generate the atomic segments and activity descriptions, tackling the issue of insensitivity. 2) Context Aggregation that employs the LLMs with a self-evaluation process to aggregate and summarize global video information for optimized pseudo moment-query pairs, tackling the issue of isolation and lack of regulation. Comprehensive experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our BTDP method.

Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information