Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Authors

  • Xiaoyang Liu Shanghai Jiao Tong University
  • Boran Wen Shanghai Jiao Tong University
  • Xinpeng Liu Shanghai Jiao Tong University Shanghai Innovation Institute
  • Zizheng Zhou Shanghai Jiao Tong University
  • Hongwei Fan Peking University
  • Cewu Lu Shanghai Jiao Tong University
  • Lizhuang Ma Shanghai Jiao Tong University
  • Yulong Chen Shanghai Jiao Tong University
  • Yong-Lu Li Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i6.32599

Abstract

Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today’s detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines.

Downloads

Published

2025-04-11

How to Cite

Liu, X., Wen, B., Liu, X., Zhou, Z., Fan, H., Lu, C., … Li, Y.-L. (2025). Interacted Object Grounding in Spatio-Temporal Human-Object Interactions. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 5622–5630. https://doi.org/10.1609/aaai.v39i6.32599

Issue

Section

AAAI Technical Track on Computer Vision V