Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization

Authors

  • Xiongwen Deng School of Software Engineering, Xi’an Jiaotong University School of Software, Shandong University
  • Haoyu Tang School of Software, Shandong University
  • Han Jiang School of Software Engineering, Xi’an Jiaotong University
  • Qinghai Zheng College of Computer and Data Science, Fuzhou University
  • Jihua Zhu School of Software Engineering, Xi’an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v39i3.32276

Abstract

Zero-shot Natural Language Video Localization (NLVL) aims to automatically generate moments and corresponding pseudo queries from raw videos for the training of the localization model without any manual annotations. Existing approaches typically produce pseudo queries as simple words, which overlook the complexity of queries in real-world scenarios. Considering the powerful text modeling capabilities of large language models (LLMs), leveraging LLMs to generate complete queries that are closer to human descriptions is a potential solution. However, directly integrating LLMs into existing approaches introduces several issues, including insensitivity, isolation, and lack of regulation, which prevent the full exploitation of LLMs to enhance zero-shot NLVL performance. To address these issues, we propose BTDP, an innovative framework for Boundary-aware Temporal Dynamic Pseudo-supervision pairs generation. Our method contains two crucial operations: 1) Boundary Segmentation that identifies both visual boundaries and semantic boundaries to generate the atomic segments and activity descriptions, tackling the issue of insensitivity. 2) Context Aggregation that employs the LLMs with a self-evaluation process to aggregate and summarize global video information for optimized pseudo moment-query pairs, tackling the issue of isolation and lack of regulation. Comprehensive experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our BTDP method.

Downloads

Published

2025-04-11

How to Cite

Deng, X., Tang, H., Jiang, H., Zheng, Q., & Zhu, J. (2025). Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 39(3), 2717–2725. https://doi.org/10.1609/aaai.v39i3.32276

Issue

Section

AAAI Technical Track on Computer Vision II