Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
DOI:
https://doi.org/10.1609/aaai.v38i7.28472Keywords:
CV: Video Understanding & Activity Analysis, CV: Language and VisionAbstract
Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.Downloads
Published
2024-03-24
How to Cite
Yang, S., Wang, Y., Ji, X., & Wu, X. (2024). Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 6513-6521. https://doi.org/10.1609/aaai.v38i7.28472
Issue
Section
AAAI Technical Track on Computer Vision VI