What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Yuxiao Wang; Yu Lei; Wolin Liang; Weiying Xue; Zhenao Wei; Nan Zhuang; Qi Liu

doi:10.1609/aaai.v40i21.38841

Authors

Yuxiao Wang South China University of Technology
Yu Lei Southwest Jiaotong University
Wolin Liang South China University of Technology
Weiying Xue South China University of Technology
Zhenao Wei Suzhou City University
Nan Zhuang Zhejiang University
Qi Liu South China University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i21.38841

Abstract

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider what action is occurring and where it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component.

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information