Weakly Supervised Multimodal Affordance Grounding for Egocentric Images

Authors

  • Lingjing Xu State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China
  • Yang Gao State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China
  • Wenfeng Song Computer School, Beijing Information Science and Technology University, China
  • Aimin Hao State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China

DOI:

https://doi.org/10.1609/aaai.v38i6.28451

Keywords:

CV: Segmentation, CV: Multi-modal Vision

Abstract

To enhance the interaction between intelligent systems and the environment, locating the affordance regions of objects is crucial. These regions correspond to specific areas that provide distinct functionalities. Humans often acquire the ability to identify these regions through action demonstrations and verbal instructions. In this paper, we present a novel multimodal framework that extracts affordance knowledge from exocentric images, which depict human-object interactions, as well as from accompanying textual descriptions that describe the performed actions. The extracted knowledge is then transferred to egocentric images. To achieve this goal, we propose the HOI-Transfer Module, which utilizes local perception to disentangle individual actions within exocentric images. This module effectively captures localized features and correlations between actions, leading to valuable affordance knowledge. Additionally, we introduce the Pixel-Text Fusion Module, which fuses affordance knowledge by identifying regions in egocentric images that bear resemblances to the textual features defining affordances. We employ a Weakly Supervised Multimodal Affordance (WSMA) learning approach, utilizing image-level labels for training. Through extensive experiments, we demonstrate the superiority of our proposed method in terms of evaluation metrics and visual results when compared to existing affordance grounding models. Furthermore, ablation experiments confirm the effectiveness of our approach. Code:https://github.com/xulingjing88/WSMA.

Published

2024-03-24

How to Cite

Xu, L., Gao, Y., Song, W., & Hao, A. (2024). Weakly Supervised Multimodal Affordance Grounding for Egocentric Images. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6324-6332. https://doi.org/10.1609/aaai.v38i6.28451

Issue

Section

AAAI Technical Track on Computer Vision V