Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection


  • Weibo Jiang State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen
  • Weihong Ren State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen
  • Jiandong Tian State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Science
  • Liangqiong Qu Department of Statistics and Actuarial Science, The University of Hong Kong
  • Zhiyong Wang State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen
  • Honghai Liu State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen




CV: Scene Analysis & Understanding, CV: Video Understanding & Activity Analysis


Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of . Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.



How to Cite

Jiang, W., Ren, W., Tian, J., Qu, L., Wang, Z., & Liu, H. (2024). Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2543-2551. https://doi.org/10.1609/aaai.v38i3.28031



AAAI Technical Track on Computer Vision II