Object-Centric Latent Action Learning
DOI:
https://doi.org/10.1609/aaai.v40i27.39423Abstract
Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).Downloads
Published
2026-03-14
How to Cite
Klepach, A., Nikulin, A., Zisman, I., Tarasov, D., Derevyagin, A., Polubarov, A., … Kurenkov, V. (2026). Object-Centric Latent Action Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(27), 22626–22634. https://doi.org/10.1609/aaai.v40i27.39423
Issue
Section
AAAI Technical Track on Machine Learning IV