Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency

Peijun Bao; Zihao Shao; Wenhan Yang; Boon Poh Ng; Meng Hwa Er; Alex C. Kot

doi:10.1609/aaai.v38i2.27832

Authors

Peijun Bao Nanyang Technological University
Zihao Shao Peking University
Wenhan Yang Peng Cheng Laboratory
Boon Poh Ng Nanyang Technological University
Meng Hwa Er Nanyang Technological University
Alex C. Kot Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v38i2.27832

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis, NLP: Language Grounding & Multi-modal NLP

Abstract

Natural language video localization plays a pivotal role in video understanding, and leveraging weakly-labeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries.

Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription