Video-Audio Domain Generalization via Confounder Disentanglement

Authors

  • Shengyu Zhang Zhejiang University
  • Xusheng Feng University of Electronic Science and Technology of China
  • Wenyan Fan Zhejiang University
  • Wenjing Fang Ant group
  • Fuli Feng University of Science and Technology of China
  • Wei Ji National University of Singapore
  • Shuo Li National University of Singapore
  • Li Wang Ant Group
  • Shanshan Zhao The University of Sydney
  • Zhou Zhao Zhejiang University
  • Tat-Seng Chua National University of Singapore
  • Fei Wu Zhejiang University Shanghai AI Laboratory

DOI:

https://doi.org/10.1609/aaai.v37i12.26787

Keywords:

General

Abstract

Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

Downloads

Published

2023-06-26

How to Cite

Zhang, S., Feng, X., Fan, W., Fang, W., Feng, F., Ji, W., Li, S., Wang, L., Zhao, S., Zhao, Z., Chua, T.-S., & Wu, F. (2023). Video-Audio Domain Generalization via Confounder Disentanglement. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12), 15322-15330. https://doi.org/10.1609/aaai.v37i12.26787

Issue

Section

AAAI Special Track on Safe and Robust AI