How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

Authors

  • Bo Peng University of Science and Technology of China Alibaba Group
  • Pi Bu Alibaba Group
  • Keyu Pan Independent Researcher
  • Xinrun Xu Institute of Software, Chinese Academy of Sciences Alibaba Group
  • Yingxiu Zhao Alibaba Group
  • Miao Chen Alibaba Group
  • Yang Du Alibaba Group
  • Lin Li Alibaba Group
  • Jun Song Alibaba Group
  • Tong Xu University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i10.37781

Abstract

Recent advances in vision–language models (VLMs) have shed light on human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents still rely on high-level commands or discretised action spaces—``non-native'' settings that diverge markedly from the real world. Moreover, current benchmarks focus exclusively on high-level tasks, while lacking joint evaluation and analysis on both low- and high-level. To bridge these gaps, we present \textbf{NativeEmbodied}, a challenging benchmark for VLM-driven embodied agents that adopts a unified, native low-level action space. Built upon diverse simulated scenes, NativeEmbodied first designs three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed and comprehensive performance analysis, we further decouple the entangled skills behind complex tasks and construct four types of low-level tasks, each corresponding to a key fundamental embodied skill. This joint evaluation across task and skill granularities enables a fine-grained assessment of embodied agent. Comprehensive experiments on the best VLMs reveal pronounced deficiencies in certain fundamental embodied skills. Further analysis shows that these bottlenecks severely constrain performance on high-level tasks. Our NativeEmbodied not only pinpoints the key challenges faced by current VLM-driven embodied agents, but also provides valuable insight for future development of this field.

Downloads

Published

2026-03-14

How to Cite

Peng, B., Bu, P., Pan, K., Xu, X., Zhao, Y., Chen, M., … Xu, T. (2026). How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8322–8330. https://doi.org/10.1609/aaai.v40i10.37781

Issue

Section

AAAI Technical Track on Computer Vision VII