Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Xiuxiu Qi; Yu Yang; Jiannong Cao; Luyao Bai; Chongshan Fan; Chengtai Cao; Hongpeng Wang

doi:10.1609/aaai.v40i29.39677

Authors

Xiuxiu Qi The College of Artificial Intelligence & Shenzhen Research Institute, Nankai University, Tianjin, China Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Yu Yang Centre for Learning, Teaching and Technology, The Education University of Hong Kong, Hong Kong SAR, China
Jiannong Cao Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Luyao Bai Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Chongshan Fan The College of Artificial Intelligence & Shenzhen Research Institute, Nankai University, Tianjin, China
Chengtai Cao Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
Hongpeng Wang The College of Artificial Intelligence & Shenzhen Research Institute, Nankai University, Tianjin, China

DOI:

https://doi.org/10.1609/aaai.v40i29.39677

Abstract

Language-Conditioned Manipulation (LCM) facilitates human-robot interaction via Behavioral Cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (i.e., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL’s generalization under unseen and noisy object states.

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information