Tools Identification By On-Board Adaptation of Vision-and-Language Models

Jun Hu; Phil Miller; Michael Lomnitz; Saurabh Farkya; Emre Yilmaz; Aswin Raghavan; David Zhang; Michael Piacentino

doi:10.1609/aaai.v38i21.30569

Authors

Jun Hu SRI International
Phil Miller SRI International
Michael Lomnitz SRI International
Saurabh Farkya SRI International
Emre Yilmaz SRI International
Aswin Raghavan SRI International
David Zhang SRI International
Michael Piacentino SRI International

DOI:

https://doi.org/10.1609/aaai.v38i21.30569

Keywords:

Artificial Intelligence, AI platforms and applications for edge computing and Internet of Things, Human-AI interaction (including Human-robot interaction), Intelligent collaborative systems

Abstract

A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.

Tools Identification By On-Board Adaptation of Vision-and-Language Models

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information