CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots

Authors

  • Xiao Zhao Tencent AD Lab
  • Chang Liu Tencent AD Lab
  • Ruiteng Ji Tencent AD Lab
  • Zheyuan Zhang Tencent AD Lab
  • Mingxu Zhu Tencent AD Lab
  • Linna Song Tencent AD Lab
  • Zhe Ren Tencent AD Lab
  • Luo Qingliang Tencent AD Lab
  • YuHang Gao Tencent AD Lab
  • Zhaolong Du Tencent AD Lab
  • Chufan Guo Tencent AD Lab
  • Kuifeng Su Tencent AD Lab

DOI:

https://doi.org/10.1609/aaai.v40i43.40980

Abstract

Recent advances in vision language models (VLMs) have demonstrated remarkable potential in embodied navigation tasks. However, existing robot-centric datasets primarily focus on traditional 3D tasks such as perception and prediction, lacking adequate support for vision-language tasks. Vision-language-navigation (VLN) is a key capability for achieving human-like and interpretable navigation in complex environments. In this study, we present CoT-VLNBench, the first large-scale benchmark and dataset designed for chain-of-thought (CoT) reasoning in quadruped robot navigation. Our dataset encompasses a diverse range of indoor and outdoor scenes, multi-step navigation trajectories, and rich natural language instructions, all annotated with fine-grained CoT reasoning traces. Specifically, it contains 175K frames, 5.25M 3D bounding boxes, and 875K vision–question–answer (VQA) pairs. This comprehensive resource enables thorough evaluation of embodied agents’ perceptual and step-by-step reasoning abilities. Furthermore, we propose a novel CoT-VLN model, a state-of-the-art 7B VLN model that integrates visual, linguistic, and reasoning modules, to facilitate interpretable and effective navigation. Extensive experiments demonstrate that our approach significantly outperforms existing non-VLMs baselines on the new benchmark, underscoring the importance of CoT-VLN in embodied navigation. We hope that CoT-VLNBench will serve as a valuable resource to advance research at the intersection of robotics, vision, language, and reasoning.

Downloads

Published

2026-03-14

How to Cite

Zhao, X., Liu, C., Ji, R., Zhang, Z., Zhu, M., Song, L., … Su, K. (2026). CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots. Proceedings of the AAAI Conference on Artificial Intelligence, 40(43), 36573–36581. https://doi.org/10.1609/aaai.v40i43.40980

Issue

Section

AAAI Technical Track on Planning, Routing, and Scheduling