GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering
DOI:
https://doi.org/10.1609/aaai.v39i18.34154Abstract
Deep learning models with large-scale backbones have been increasingly adopted to tackle complex visual question answering (VQA) problems in real settings. While providing powerful learning capacities to handle the high-dimensional and multimodal VQA data, these models tend to suffer from the memorization effect leading to overconfident predictions. This can significantly limit their applicability in critical domains (e.g., medicine, cyber-security, and public safety), where confidently wrong predictions may lead to severe consequences. In this work, we propose to perform novel low-rank network factorization, resulting in much better-calibrated networks. These low-rank factorized networks are then aggregated into an ensemble guided by a generalized focal loss to further improve the overall performance and calibration. The overall framework, referred to as the Generalized focal Loss Ensemble of low-rank Networks (GLEN), is an important step toward developing well-calibrated VQA models. We theoretically demonstrate that the generalized focal loss provides a more balanced bias-variance trade-off, which guarantees to lower the confidence of the incorrect predictions, resulting in improved calibration. Extensive experimentation conducted on benchmark datasets and comparison on various VQA models shows that GLEN leads to much better calibration over both in-distribution and out-of-distribution data without sacrificing the VQA accuracy.Published
2025-04-11
How to Cite
Mozaffari, M., Sapkota, H., & Yu, Q. (2025). GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 39(18), 19563–19571. https://doi.org/10.1609/aaai.v39i18.34154
Issue
Section
AAAI Technical Track on Machine Learning IV