Object-Centric Latent Action Learning

Authors

  • Albina Klepach dunnolab.ai
  • Alexander Nikulin dunnolab.ai Moscow State University
  • Ilya Zisman dunnolab.ai
  • Denis Tarasov dunnolab.ai
  • Alexander Derevyagin dunnolab.ai Higher School of Economics
  • Andrei Polubarov dunnolab.ai
  • Nikita Lyubaykin dunnolab.ai Innopolis University
  • Igor Kiselev Accenture
  • Vladislav Kurenkov dunnolab.ai Innopolis University

DOI:

https://doi.org/10.1609/aaai.v40i27.39423

Abstract

Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).

Published

2026-03-14

How to Cite

Klepach, A., Nikulin, A., Zisman, I., Tarasov, D., Derevyagin, A., Polubarov, A., … Kurenkov, V. (2026). Object-Centric Latent Action Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(27), 22626–22634. https://doi.org/10.1609/aaai.v40i27.39423

Issue

Section

AAAI Technical Track on Machine Learning IV