Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Authors

  • Ziwei Liu College of Computer Science, Sichuan University, China; College of Computing and Data Science, Nanyang Technological University, Singapore
  • Borui Kang School of Computer Science, Nanjing University, China
  • Wei Li College of Computer Science, Sichuan University, China
  • Hangjie Yuan DAMO Academy, Alibaba Group
  • Yanbing Yang College of Computer Science, Sichuan University, China
  • Wenbin Li School of Computer Science, Nanjing University, China
  • Yifan Zhu College of Computer Science, Beijing University of Posts and Telecommunications, China
  • Tao Feng Department of Computer Science and Technology, Tsinghua University, China
  • Jun Luo College of Computing and Data Science, Nanyang Technological University, Singapore

DOI:

https://doi.org/10.1609/aaai.v40i28.39580

Abstract

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

Downloads

Published

2026-03-14

How to Cite

Liu, Z., Kang, B., Li, W., Yuan, H., Yang, Y., Li, W., … Luo, J. (2026). Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(28), 24026–24034. https://doi.org/10.1609/aaai.v40i28.39580

Issue

Section

AAAI Technical Track on Machine Learning V