Video-Audio Domain Generalization via Confounder Disentanglement

Shengyu Zhang; Xusheng Feng; Wenyan Fan; Wenjing Fang; Fuli Feng; Wei Ji; Shuo Li; Li Wang; Shanshan Zhao; Zhou Zhao; Tat-Seng Chua; Fei Wu

doi:10.1609/aaai.v37i12.26787

Authors

Shengyu Zhang Zhejiang University
Xusheng Feng University of Electronic Science and Technology of China
Wenyan Fan Zhejiang University
Wenjing Fang Ant group
Fuli Feng University of Science and Technology of China
Wei Ji National University of Singapore
Shuo Li National University of Singapore
Li Wang Ant Group
Shanshan Zhao The University of Sydney
Zhou Zhao Zhejiang University
Tat-Seng Chua National University of Singapore
Fei Wu Zhejiang University Shanghai AI Laboratory

DOI:

https://doi.org/10.1609/aaai.v37i12.26787

Keywords:

General

Abstract

Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

Video-Audio Domain Generalization via Confounder Disentanglement

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription