Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency

Authors

  • Peijun Bao Nanyang Technological University
  • Zihao Shao Peking University
  • Wenhan Yang Peng Cheng Laboratory
  • Boon Poh Ng Nanyang Technological University
  • Meng Hwa Er Nanyang Technological University
  • Alex C. Kot Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v38i2.27832

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis, NLP: Language Grounding & Multi-modal NLP

Abstract

Natural language video localization plays a pivotal role in video understanding, and leveraging weakly-labeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries.

Downloads

Published

2024-03-24

How to Cite

Bao, P., Shao, Z., Yang, W., Ng, B. P., Er, M. H., & Kot, A. C. (2024). Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 747-755. https://doi.org/10.1609/aaai.v38i2.27832

Issue

Section

AAAI Technical Track on Computer Vision I