Learning Disentangled Classification and Localization Representations for Temporal Action Localization

Authors

  • Zixin Zhu Xi'an Jiaotong University
  • Le Wang Xi'an Jiaotong University
  • Wei Tang University of Illinois at Chicago
  • Ziyi Liu Wormpex AI Research
  • Nanning Zheng Xi'an Jiaotong University
  • Gang Hua Wormpex AI Research

DOI:

https://doi.org/10.1609/aaai.v36i3.20277

Keywords:

Computer Vision (CV)

Abstract

A common approach to Temporal Action Localization (TAL) is to generate action proposals and then perform action classification and localization on them. For each proposal, existing methods universally use a shared proposal-level representation for both tasks. However, our analysis indicates that this shared representation focuses on the most discriminative frames for classification, e.g., ``take-offs" rather than ``run-ups" in distinguishing ``high jump" and ``long jump", while frames most relevant to localization, such as the start and end frames of an action, are largely ignored. In other words, such a shared representation can not simultaneously handle both classification and localization tasks well, and it makes precise TAL difficult. To address this challenge, this paper disentangles the shared representation into classification and localization representations. The disentangled classification representation focuses on the most discriminative frames, and the disentangled localization representation focuses on the action phase as well as the action start and end. Our model could be divided into two sub-networks, i.e., the disentanglement network and the context-based aggregation network. The disentanglement network is an autoencoder to learn orthogonal hidden variables of classification and localization. The context-based aggregation network aggregates the classification and localization representations by modeling local and global contexts. We evaluate our proposed method on two popular benchmarks for TAL, which outperforms all state-of-the-art methods.

Downloads

Published

2022-06-28

How to Cite

Zhu, Z., Wang, L., Tang, W., Liu, Z., Zheng, N., & Hua, G. (2022). Learning Disentangled Classification and Localization Representations for Temporal Action Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 3644-3652. https://doi.org/10.1609/aaai.v36i3.20277

Issue

Section

AAAI Technical Track on Computer Vision III