How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

Bo Peng; Pi Bu; Keyu Pan; Xinrun Xu; Yingxiu Zhao; Miao Chen; Yang Du; Lin Li; Jun Song; Tong Xu

doi:10.1609/aaai.v40i10.37781

Authors

Bo Peng University of Science and Technology of China Alibaba Group
Pi Bu Alibaba Group
Keyu Pan Independent Researcher
Xinrun Xu Institute of Software, Chinese Academy of Sciences Alibaba Group
Yingxiu Zhao Alibaba Group
Miao Chen Alibaba Group
Yang Du Alibaba Group
Lin Li Alibaba Group
Jun Song Alibaba Group
Tong Xu University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i10.37781

Abstract

Recent advances in vision–language models (VLMs) have shed light on human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents still rely on high-level commands or discretised action spaces—``non-native'' settings that diverge markedly from the real world. Moreover, current benchmarks focus exclusively on high-level tasks, while lacking joint evaluation and analysis on both low- and high-level. To bridge these gaps, we present \textbf{NativeEmbodied}, a challenging benchmark for VLM-driven embodied agents that adopts a unified, native low-level action space. Built upon diverse simulated scenes, NativeEmbodied first designs three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed and comprehensive performance analysis, we further decouple the entangled skills behind complex tasks and construct four types of low-level tasks, each corresponding to a key fundamental embodied skill. This joint evaluation across task and skill granularities enables a fine-grained assessment of embodied agent. Comprehensive experiments on the best VLMs reveal pronounced deficiencies in certain fundamental embodied skills. Further analysis shows that these bottlenecks severely constrain performance on high-level tasks. Our NativeEmbodied not only pinpoints the key challenges faced by current VLM-driven embodied agents, but also provides valuable insight for future development of this field.

How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information