Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization

Yawen Cui; Li Liu; Zitong Yu; Guanjie Huang; Xiaopeng Hong

doi:10.1609/aaai.v39i15.33770

Authors

Yawen Cui Hong Kong Polytechnic University
Li Liu The Hong Kong University of Science and Technology (Guangzhou)
Zitong Yu Great Bay University
Guanjie Huang The Hong Kong University of Science and Technology (Guangzhou)
Xiaopeng Hong Harbin Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i15.33770

Abstract

Audio-Visual Learning (AVL) aims at the audio-visual perception with both audio and vision modalities. AVL also suffers from data insufficiency in many applications as with other unimodal tasks. Concurrently, AVL often needs to continuously learn over time rather than all knowledge simultaneously. Considering the above two perspectives, our work mainly focuses on benchmarking the unexplored Few-Shot Audio-Visual Class-Incremental Learning (FS-AVCIL), i.e., continually perceiving novel categories described by a limited number of labeled examples with audio and visual modalities. Firstly, we provide the detailed task configuration together with a thorough analysis of the challenges in FS-AVCIL: (1) how to efficiently learn and fuse multimodal information with limited labeled examples; and (2) how to alleviate catastrophic forgetting cross-modal semantic correlations with limited data. Then, we propose an efficient framework based on Vision Transformer to solve FS-AVCIL. This framework contains two parts: temporal-residual prompting for audio-visual synergy adapter and temporal prompt regularization. Specifically, temporal-residual prompting is incorporated into the audio-visual adapter to efficiently finetune the pre-trained foundation model with limited data and capture audio-visual correlation by learning temporal-relevant prompts. Besides, we regularize temporal-relevant prompts to memorize previous knowledge by fully using the temporal knowledge from various perspectives. This framework is validated in audio-visual classification tasks under the FS-AVCIL scenario, and extensive experiments demonstrate its superior performance.

Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information