Tree-Structured Trajectory Encoding for Vision-and-Language Navigation
DOI:
https://doi.org/10.1609/aaai.v37i3.25494Keywords:
CV: Language and Vision, CV: Multi-modal VisionAbstract
Over the past few years, the research on vision-and-language navigation (VLN) has made tremendous progress. Many previous works attempted to improve the performance from different aspects like training strategy, data augmentation, pre-training, etc. This work focuses on a rarely-explored aspect in VLN, namely the trajectory organization and encoding during the navigation. Most of existing state-of-the-art VLN models adopt a vanilla sequential strategy for encoding the trajectories. Such strategy takes the whole trajectory as a single sequence to estimate the current state, no matter whether the agent moved smoothly or perhaps made mistakes and backtracked in the past. We show that the sequential encoding may largely lose this kind of fine-grained structure in the trajectory, which could hamper the later state estimation and decision making. In order to solve this problem, this work proposes a novel tree-structured trajectory encoding strategy. The whole trajectory is organized as a tree rooted from the starting position, and encoded using our Tree-Transformer module to fully extract the fine-grained historical information. Besides, as the spatial topology could be easily embedded in the trajectory tree, we further design a tree-based action space to allow the agent making long-range error-correction in one decision. We implement the holistic agent based on cross-modal transformer and train it with a newly-proposed Tree-nDTW reward. On the benchmark dataset R2R, our model achieves a surpassing success rate (SR) of 68% on val-unseen and 66% on test. We further conduct extensive ablation studies and analyses to provide more insights for the effectiveness our designs.Downloads
Published
2023-06-26
How to Cite
Zhou, X., & Mu, Y. (2023). Tree-Structured Trajectory Encoding for Vision-and-Language Navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 3814-3824. https://doi.org/10.1609/aaai.v37i3.25494
Issue
Section
AAAI Technical Track on Computer Vision III