Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Xiaoyang Liu; Boran Wen; Xinpeng Liu; Zizheng Zhou; Hongwei Fan; Cewu Lu; Lizhuang Ma; Yulong Chen; Yong-Lu Li

doi:10.1609/aaai.v39i6.32599

Authors

Xiaoyang Liu Shanghai Jiao Tong University
Boran Wen Shanghai Jiao Tong University
Xinpeng Liu Shanghai Jiao Tong University Shanghai Innovation Institute
Zizheng Zhou Shanghai Jiao Tong University
Hongwei Fan Peking University
Cewu Lu Shanghai Jiao Tong University
Lizhuang Ma Shanghai Jiao Tong University
Yulong Chen Shanghai Jiao Tong University
Yong-Lu Li Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i6.32599

Abstract

Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today’s detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines.

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information