Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Authors

  • Shuo Yang Shenzhen MSU-BIT University Beijing Institute of Technology
  • Yongqi Wang Beijing Institute of Technology
  • Xiaofeng Ji Beijing Institute of Technology
  • Xinxiao Wu Beijing Institute of Technology Shenzhen MSU-BIT University

DOI:

https://doi.org/10.1609/aaai.v38i7.28472

Keywords:

CV: Video Understanding & Activity Analysis, CV: Language and Vision

Abstract

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

Published

2024-03-24

How to Cite

Yang, S., Wang, Y., Ji, X., & Wu, X. (2024). Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 6513-6521. https://doi.org/10.1609/aaai.v38i7.28472

Issue

Section

AAAI Technical Track on Computer Vision VI