Learning Visual Context for Group Activity Recognition

Authors

  • Hangjie Yuan Zhejiang University College of Control Science and Engineering, Zhejiang University, Hangzhou, China
  • Dong Ni Zhejiang University College of Control Science and Engineering, Zhejiang University, Hangzhou, China State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China

DOI:

https://doi.org/10.1609/aaai.v35i4.16437

Keywords:

Video Understanding & Activity Analysis, Scene Analysis & Understanding, Relational Learning, Visual Reasoning & Symbolic Representations

Abstract

Group activity recognition aims to recognize an overall activity in a multi-person scene. Previous methods strive to reason on individual features. However, they under-explore the person-specific contextual information, which is significant and informative in computer vision tasks. In this paper, we propose a new reasoning paradigm to incorporate global contextual information. Specifically, we propose two modules to bridge the gap between group activity and visual context. The first is Transformer based Context Encoding (TCE) module, which enhances individual representation by encoding global contextual information to individual features and refining the aggregated information. The second is Spatial-Temporal Bilinear Pooling (STBiP) module. It firstly further explores pairwise relationships for the context encoded individual representation, then generates semantic representations via gated message passing on a constructed spatial-temporal graph. On their basis, we further design a two-branch model that integrates the designed modules into a pipeline. Systematic experiments demonstrate each module's effectiveness on either branch. Visualizations indicate that visual contextual cues can be aggregated globally by TCE. Moreover, our method achieves state-of-the-art results on two widely used benchmarks using only RGB images as input and 2D backbones.

Downloads

Published

2021-05-18

How to Cite

Yuan, H., & Ni, D. (2021). Learning Visual Context for Group Activity Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4), 3261-3269. https://doi.org/10.1609/aaai.v35i4.16437

Issue

Section

AAAI Technical Track on Computer Vision III