CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots

Xiao Zhao; Chang Liu; Ruiteng Ji; Zheyuan Zhang; Mingxu Zhu; Linna Song; Zhe Ren; Luo Qingliang; YuHang Gao; Zhaolong Du; Chufan Guo; Kuifeng Su

doi:10.1609/aaai.v40i43.40980

Authors

Xiao Zhao Tencent AD Lab
Chang Liu Tencent AD Lab
Ruiteng Ji Tencent AD Lab
Zheyuan Zhang Tencent AD Lab
Mingxu Zhu Tencent AD Lab
Linna Song Tencent AD Lab
Zhe Ren Tencent AD Lab
Luo Qingliang Tencent AD Lab
YuHang Gao Tencent AD Lab
Zhaolong Du Tencent AD Lab
Chufan Guo Tencent AD Lab
Kuifeng Su Tencent AD Lab

DOI:

https://doi.org/10.1609/aaai.v40i43.40980

Abstract

Recent advances in vision language models (VLMs) have demonstrated remarkable potential in embodied navigation tasks. However, existing robot-centric datasets primarily focus on traditional 3D tasks such as perception and prediction, lacking adequate support for vision-language tasks. Vision-language-navigation (VLN) is a key capability for achieving human-like and interpretable navigation in complex environments. In this study, we present CoT-VLNBench, the first large-scale benchmark and dataset designed for chain-of-thought (CoT) reasoning in quadruped robot navigation. Our dataset encompasses a diverse range of indoor and outdoor scenes, multi-step navigation trajectories, and rich natural language instructions, all annotated with fine-grained CoT reasoning traces. Specifically, it contains 175K frames, 5.25M 3D bounding boxes, and 875K vision–question–answer (VQA) pairs. This comprehensive resource enables thorough evaluation of embodied agents’ perceptual and step-by-step reasoning abilities. Furthermore, we propose a novel CoT-VLN model, a state-of-the-art 7B VLN model that integrates visual, linguistic, and reasoning modules, to facilitate interpretable and effective navigation. Extensive experiments demonstrate that our approach significantly outperforms existing non-VLMs baselines on the new benchmark, underscoring the importance of CoT-VLN in embodied navigation. We hope that CoT-VLNBench will serve as a valuable resource to advance research at the intersection of robotics, vision, language, and reasoning.

CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information