TCoT: Trajectory Chain-of-Thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model

Authors

  • Xiang Li Department of Electronic Engineering , Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
  • Ya-Li Li Department of Electronic Engineering , Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
  • Yuan Wang Department of Electronic Engineering , Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
  • Huaqiang Wang Department of Electronic Engineering , Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
  • Shengjin Wang Department of Electronic Engineering , Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China

DOI:

https://doi.org/10.1609/aaai.v40i8.37577

Abstract

Recent advances in vision-language-action (VLA) models have demonstrated impressive generalization for robotic manipulation. However, these models often operate by directly mapping visual and linguistic inputs to subsequent actions, lacking intermediate task planning, along with failure detection and recovery ability. These limitations prevent them from effectively decomposing complex tasks, recognizing problems, and correcting erroneous actions, ultimately resulting in complete task failure. This significantly hinders their ability to perform long-horizon tasks and generalization ability. To this end, we introduce TCoT: Trajectory Chain-of-Thought, a unified VLA framework that enhances this direct mapping with trajectory planning as well as failure detection and recovery. TCoT leverages hierarchy trajectories as a precise and compact representation of CoT reasoning for manipulation: global planning provides a high-level, goal-oriented trajectory to guide the robot toward its task objective, while local planning focuses on real-time adjustments to address dynamic changes. Moreover, we designed the Global-Local Switching Recovery algorithm that detects and effectively recovers from failures. Experimental results reveal that TCoT surpasses the state-of-the-art methods across both real and simulated scenarios and exhibits superior generalization capabilities.

Downloads

Published

2026-03-14

How to Cite

Li, X., Li, Y.-L., Wang, Y., Wang, H., & Wang, S. (2026). TCoT: Trajectory Chain-of-Thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6486–6494. https://doi.org/10.1609/aaai.v40i8.37577

Issue

Section

AAAI Technical Track on Computer Vision V