Counterfactual Debiasing for Physical Audiovisual Commonsense Reasoning

Authors

  • Daoming Zong SenseTime Research
  • Chaoyue Ding Fudan University
  • Kaitao Chen Fudan University
  • Yinsheng Li Fudan University
  • Shuaiyu Wang Fudan University

DOI:

https://doi.org/10.1609/aaai.v39i14.33675

Abstract

Physical commonsense is an essential aspect of human cognition, involving an intuitive understanding of the physical properties and interactions of everyday objects and materials. Though physical commonsense reasoning should inherently be a multisensory task, integrating both video and audio signals, existing physical audiovisual commonsense reasoning (PACR) models predominantly rely on visual information. This reliance leads to spurious correlations and undermines the models’ reasoning and generalization abilities. To counteract this, we introduce a model-agnostic Counterfactual Physical Audiovisual Commonsense Reasoning (CF-PACR) framework aimed at mitigating visual bias-induced spurious effects. Specifically, we construct a traditional PACR model using both audio and visual information as the factual reasoning model. Subsequently, in the counterfactual reasoning model, we isolate visual information to estimate direct effects. Finally, we subtract the direct effects from the total effects across modalities to derive indirect effects, thereby mitigating visual biases. Extensive experiments validate the effectiveness and generalizability of CF-PACR in alleviating the spurious correlations between visual modality and model predictions.

Downloads

Published

2025-04-11

How to Cite

Zong, D., Ding, C., Chen, K., Li, Y., & Wang, S. (2025). Counterfactual Debiasing for Physical Audiovisual Commonsense Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(14), 15265–15273. https://doi.org/10.1609/aaai.v39i14.33675

Issue

Section

AAAI Technical Track on Knowledge Representation and Reasoning