Open-Vocabulary Video Relation Extraction

Authors

  • Wentao Tian Fudan University
  • Zheng Wang Zhengjiang University of Technology
  • Yuqian Fu Fudan University
  • Jingjing Chen Fudan University
  • Lechao Cheng Zhejiang Lab

DOI:

https://doi.org/10.1609/aaai.v38i6.28328

Keywords:

CV: Video Understanding & Activity Analysis

Abstract

A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE. Our code and dataset are available at https://github.com/Iriya99/OVRE.

Downloads

Published

2024-03-24

How to Cite

Tian, W., Wang, Z., Fu, Y., Chen, J., & Cheng, L. (2024). Open-Vocabulary Video Relation Extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5215-5223. https://doi.org/10.1609/aaai.v38i6.28328

Issue

Section

AAAI Technical Track on Computer Vision V