Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization
DOI:
https://doi.org/10.1609/aaai.v39i15.33770Abstract
Audio-Visual Learning (AVL) aims at the audio-visual perception with both audio and vision modalities. AVL also suffers from data insufficiency in many applications as with other unimodal tasks. Concurrently, AVL often needs to continuously learn over time rather than all knowledge simultaneously. Considering the above two perspectives, our work mainly focuses on benchmarking the unexplored Few-Shot Audio-Visual Class-Incremental Learning (FS-AVCIL), i.e., continually perceiving novel categories described by a limited number of labeled examples with audio and visual modalities. Firstly, we provide the detailed task configuration together with a thorough analysis of the challenges in FS-AVCIL: (1) how to efficiently learn and fuse multimodal information with limited labeled examples; and (2) how to alleviate catastrophic forgetting cross-modal semantic correlations with limited data. Then, we propose an efficient framework based on Vision Transformer to solve FS-AVCIL. This framework contains two parts: temporal-residual prompting for audio-visual synergy adapter and temporal prompt regularization. Specifically, temporal-residual prompting is incorporated into the audio-visual adapter to efficiently finetune the pre-trained foundation model with limited data and capture audio-visual correlation by learning temporal-relevant prompts. Besides, we regularize temporal-relevant prompts to memorize previous knowledge by fully using the temporal knowledge from various perspectives. This framework is validated in audio-visual classification tasks under the FS-AVCIL scenario, and extensive experiments demonstrate its superior performance.Downloads
Published
2025-04-11
How to Cite
Cui, Y., Liu, L., Yu, Z., Huang, G., & Hong, X. (2025). Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization. Proceedings of the AAAI Conference on Artificial Intelligence, 39(15), 16118–16126. https://doi.org/10.1609/aaai.v39i15.33770
Issue
Section
AAAI Technical Track on Machine Learning I