Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Ziwei Liu; Borui Kang; Wei Li; Hangjie Yuan; Yanbing Yang; Wenbin Li; Yifan Zhu; Tao Feng; Jun Luo

doi:10.1609/aaai.v40i28.39580

Authors

Ziwei Liu College of Computer Science, Sichuan University, China; College of Computing and Data Science, Nanyang Technological University, Singapore
Borui Kang School of Computer Science, Nanjing University, China
Wei Li College of Computer Science, Sichuan University, China
Hangjie Yuan DAMO Academy, Alibaba Group
Yanbing Yang College of Computer Science, Sichuan University, China
Wenbin Li School of Computer Science, Nanjing University, China
Yifan Zhu College of Computer Science, Beijing University of Posts and Telecommunications, China
Tao Feng Department of Computer Science and Technology, Tsinghua University, China
Jun Luo College of Computing and Data Science, Nanyang Technological University, Singapore

DOI:

https://doi.org/10.1609/aaai.v40i28.39580

Abstract

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information