LatentVLA: Taming Latent Space for Generalizable and Long-Horizon Bimanual Manipulation

Authors

  • Junming Wang University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i22.38926

Abstract

Current paradigms for robotic imitation learning face a stark trade-off between the motion fidelity of diffusion models and the data scalability of inverse dynamics models. The latter, while scalable, often learns a latent action space disconnected from physical reality. This flaw leads to critical failures: temporal entanglement, where the model cannot distinguish between visually similar states requiring distinct actions, e.g., a gripper approaching versus receding from an object. This ambiguity, compounded by discretization artifacts and sensitivity to task-irrelevant dynamics, renders robust planning infeasible. We introduce LatentVLA, a vision-language-action framework designed to overcome these limitations by learning a continuous and spatiotemporally grounded latent action representation. Its progressive three-stage architecture first employs a Temporal-Attentive Latent Action Model (TA-LAM) to resolve ambiguities using language-guided attention and explicit temporal encoding. Subsequently, a Latent Action Diffusion Transformer (LADT) performs planning via diffusion directly within this continuous latent space, preserving motion fidelity without tokenization. Finally, an expert policy head translates these latent plans into precise robot actions. Experiments show LatentVLA sets a new state-of-the-art across a suite of real-world bimanual tasks, outperforming prior methods and demonstrating superior zero-shot generalization and few-shot efficiency.

Published

2026-03-14

How to Cite

Wang, J. (2026). LatentVLA: Taming Latent Space for Generalizable and Long-Horizon Bimanual Manipulation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18593–18601. https://doi.org/10.1609/aaai.v40i22.38926

Issue

Section

AAAI Technical Track on Intelligent Robotics