Vision-Language Models for Robot Success Detection

Authors

  • Fiona Luo University of Pennsylvania, Philadelphia, PA

DOI:

https://doi.org/10.1609/aaai.v38i21.30552

Keywords:

Vision Language Models, Robotics, Multimodal, Policy Learning, Language Models, Multimodal Machine Learning, Reinforcement Learning, Large Language Models (LLM)

Abstract

In this work, we use Vision-Language Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies.

Downloads

Published

2024-03-24

How to Cite

Luo, F. (2024). Vision-Language Models for Robot Success Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23750-23752. https://doi.org/10.1609/aaai.v38i21.30552