Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification

Authors

  • Yachong Zhang Shandong University
  • Lei Meng Shandong University
  • Shuo Xu Shandong University
  • Zhuang Qi Shandong University
  • Wei Wu Shandong University
  • Lei Wu Shandong University
  • Xiangxu Meng Shandong University

DOI:

https://doi.org/10.1609/aaai.v40i15.38286

Abstract

Video classification requires event-level representations of objects and their interactions. Existing methods typically rely on data-driven approaches, which either learn such features from whole frames or object-centric visual regions. Therefore, the modeling of spatiotemporal interactions among objects is usually overlooked. To address this issue, this paper presents a Decomposition of Synergistic, Unique, and Redundant Causal Representations Learning (SurdCRL) model for video classification, which introduces a newly-proposed SURD causal theory to model the spatiotemporal features of both object dynamics and their in- and cross-frame interactions. Specifically, SurdCRL employs three modules to model the object-centric spatiotemporal dynamics using distinct types of causal components, where the first module Spatial-Temporal Entity Modeling decouples the frame into object and context entities, and employs a temporal message passing block to capture object state changes over time, generating spatiotemporal features as basic causal variables. Second, the Dual-Path Causal Inference module mitigates confounders among causal variables by front-door and back-door interventions, thus enabling the subsequent causal components to reflect their intrinsic effects. Finally, the Causal Composition and Selection module employs the compositional structure-aware attention to project the causal variables and their high-order interactions into the synergistic, unique, and redundant components. Experiments on two benchmarking datasets verify that SurdCRL better captures event-relevant object-centric representation by decomposing spatiotemporal object interactions into three types of causal components.

Downloads

Published

2026-03-14

How to Cite

Zhang, Y., Meng, L., Xu, S., Qi, Z., Wu, W., Wu, L., & Meng, X. (2026). Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12879–12887. https://doi.org/10.1609/aaai.v40i15.38286

Issue

Section

AAAI Technical Track on Computer Vision XII