What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Authors

  • Yuxiao Wang South China University of Technology
  • Yu Lei Southwest Jiaotong University
  • Wolin Liang South China University of Technology
  • Weiying Xue South China University of Technology
  • Zhenao Wei Suzhou City University
  • Nan Zhuang Zhejiang University
  • Qi Liu South China University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i21.38841

Abstract

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider what action is occurring and where it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component.

Published

2026-03-14

How to Cite

Wang, Y., Lei, Y., Liang, W., Xue, W., Wei, Z., Zhuang, N., & Liu, Q. (2026). What-Meets-Where: Unified Learning of Action and Contact Localization in Images. Proceedings of the AAAI Conference on Artificial Intelligence, 40(21), 17832–17840. https://doi.org/10.1609/aaai.v40i21.38841

Issue

Section

AAAI Technical Track on Humans and AI