COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Canming Xia; Peixi Peng; Guang Tan; Zhan Su; Haoran Xu; Zhenxian Liu; Luntong Li

doi:10.1609/aaai.v40i32.39915

Authors

Canming Xia School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, China Peng Cheng Laboratory, China
Peixi Peng School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, China Peng Cheng Laboratory, China
Guang Tan School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, China
Zhan Su School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, China
Haoran Xu School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, China Peng Cheng Laboratory, China
Zhenxian Liu National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, China
Luntong Li Peng Cheng Laboratory, China

DOI:

https://doi.org/10.1609/aaai.v40i32.39915

Abstract

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information