Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification

Yachong Zhang; Lei Meng; Shuo Xu; Zhuang Qi; Wei Wu; Lei Wu; Xiangxu Meng

doi:10.1609/aaai.v40i15.38286

Authors

Yachong Zhang Shandong University
Lei Meng Shandong University
Shuo Xu Shandong University
Zhuang Qi Shandong University
Wei Wu Shandong University
Lei Wu Shandong University
Xiangxu Meng Shandong University

DOI:

https://doi.org/10.1609/aaai.v40i15.38286

Abstract

Video classification requires event-level representations of objects and their interactions. Existing methods typically rely on data-driven approaches, which either learn such features from whole frames or object-centric visual regions. Therefore, the modeling of spatiotemporal interactions among objects is usually overlooked. To address this issue, this paper presents a Decomposition of Synergistic, Unique, and Redundant Causal Representations Learning (SurdCRL) model for video classification, which introduces a newly-proposed SURD causal theory to model the spatiotemporal features of both object dynamics and their in- and cross-frame interactions. Specifically, SurdCRL employs three modules to model the object-centric spatiotemporal dynamics using distinct types of causal components, where the first module Spatial-Temporal Entity Modeling decouples the frame into object and context entities, and employs a temporal message passing block to capture object state changes over time, generating spatiotemporal features as basic causal variables. Second, the Dual-Path Causal Inference module mitigates confounders among causal variables by front-door and back-door interventions, thus enabling the subsequent causal components to reflect their intrinsic effects. Finally, the Causal Composition and Selection module employs the compositional structure-aware attention to project the causal variables and their high-order interactions into the synergistic, unique, and redundant components. Experiments on two benchmarking datasets verify that SurdCRL better captures event-relevant object-centric representation by decomposing spatiotemporal object interactions into three types of causal components.

Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information