Vision-Language Models for Robot Success Detection

Fiona Luo

doi:10.1609/aaai.v38i21.30552

Authors

Fiona Luo University of Pennsylvania, Philadelphia, PA

DOI:

https://doi.org/10.1609/aaai.v38i21.30552

Keywords:

Vision Language Models, Robotics, Multimodal, Policy Learning, Language Models, Multimodal Machine Learning, Reinforcement Learning, Large Language Models (LLM)

Abstract

In this work, we use Vision-Language Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies.

Vision-Language Models for Robot Success Detection

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription