DGV: Fusing Dynamic Graphs and Vision-Language Models for Collaborative Dual-Arm Task Planning
DOI:
https://doi.org/10.1609/icaps.v36i1.42895Abstract
Dual-arm collaborative manipulation in dynamic, unstructured environments is profoundly challenging, requiring real-time handling of high-dimensional physical constraints alongside dynamic scene understanding and adaptation to high-level natural language instructions. To address these challenges, we propose the Dynamic Graph Vision-Language Model (DGV), a novel dynamic task planning framework that seamlessly integrates GNNs and VLMs. It first leverages a pre-trained VLM to integrate perceptual and semantic processing, accurately extracting object states and complex manipulation intents from the environment. This extracted information is then encoded into a dynamic spatio-temporal graph that models the robot's kinematic structure, environmental object relations, and temporal dependencies within a single, unified representation. We propose a real-time local subgraph update mechanism, which is designed to cope with rapid environmental changes. This mechanism ensures immediate action adjustments and efficient replanning based on fresh visual feedback, dramatically improving dynamic adaptability. Utilizing the updated graph structure, DGV performs efficient reasoning to generate continuous, stable, and robust dual-arm collaborative motion sequences. Our experimental results across both simulation and real-world robot platforms demonstrate that DGV achieves a task success rate nearly 20% higher than current state-of-the-art methods, while exhibiting superior performance in dynamic adaptability and robustness.Downloads
Published
2026-06-08
How to Cite
Pang, Y., Xu, J., Qiao, Z., Du, P., & Zhang, X. (2026). DGV: Fusing Dynamic Graphs and Vision-Language Models for Collaborative Dual-Arm Task Planning. Proceedings of the International Conference on Automated Planning and Scheduling, 36(1), 747–756. https://doi.org/10.1609/icaps.v36i1.42895