Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v40i28.39580Abstract
Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.Published
2026-03-14
How to Cite
Liu, Z., Kang, B., Li, W., Yuan, H., Yang, Y., Li, W., … Luo, J. (2026). Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(28), 24026–24034. https://doi.org/10.1609/aaai.v40i28.39580
Issue
Section
AAAI Technical Track on Machine Learning V