What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos

Yayun Qi; Xinxiao Wu

doi:10.1609/aaai.v40i10.37799

Authors

Yayun Qi Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
Xinxiao Wu Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

DOI:

https://doi.org/10.1609/aaai.v40i10.37799

Abstract

Object state understanding aims at recognizing the co-occurrence and transitions of multiple object states in videos. While learning from videos handles seen object states well, it struggles with novel ones. We address this task in a zero-shot setting by extracting state-specific knowledge from pre-trained models and using Vision-Language Models (VLMs) to verify whether such knowledge is visually grounded in videos. However, the extracted knowledge varies in its ability to distinguish states, and VLM observations are not always trustworthy. To address this issue, we propose a trust-aware knowledge-guided method to model knowledge trustworthiness and emphasize highly discriminative knowledge that VLMs can reliably observe. Specifically, we collect spatial knowledge for each object state from retrieved images and cues generated from a Large Language Model, then use VLMs to vote on each knowledge element by scoring its visual consistency with the video. In addition to a single scene, temporal dependencies of object states across scenes are also captured using a generative VLM. Under spatial and temporal constraints, we propose an adaptive knowledge refinement module that iteratively updates knowledge reliability weights to achieve a global consensus in object state inference across the video. Finally, object states are inferred by combining the refined weights with VLM voting results. Experiments on two datasets demonstrate the effectiveness of our method.

What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information