Vision-Language Models for Robot Success Detection


  • Fiona Luo University of Pennsylvania, Philadelphia, PA



Vision Language Models, Robotics, Multimodal, Policy Learning, Language Models, Multimodal Machine Learning, Reinforcement Learning, Large Language Models (LLM)


In this work, we use Vision-Language Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies.




How to Cite

Luo, F. (2024). Vision-Language Models for Robot Success Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23750-23752.