Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Shuo Yang; Yongqi Wang; Xiaofeng Ji; Xinxiao Wu

doi:10.1609/aaai.v38i7.28472

Authors

Shuo Yang Shenzhen MSU-BIT University Beijing Institute of Technology
Yongqi Wang Beijing Institute of Technology
Xiaofeng Ji Beijing Institute of Technology
Xinxiao Wu Beijing Institute of Technology Shenzhen MSU-BIT University

DOI:

https://doi.org/10.1609/aaai.v38i7.28472

Keywords:

CV: Video Understanding & Activity Analysis, CV: Language and Vision

Abstract

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription