What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos
DOI:
https://doi.org/10.1609/aaai.v40i10.37799Abstract
Object state understanding aims at recognizing the co-occurrence and transitions of multiple object states in videos. While learning from videos handles seen object states well, it struggles with novel ones. We address this task in a zero-shot setting by extracting state-specific knowledge from pre-trained models and using Vision-Language Models (VLMs) to verify whether such knowledge is visually grounded in videos. However, the extracted knowledge varies in its ability to distinguish states, and VLM observations are not always trustworthy. To address this issue, we propose a trust-aware knowledge-guided method to model knowledge trustworthiness and emphasize highly discriminative knowledge that VLMs can reliably observe. Specifically, we collect spatial knowledge for each object state from retrieved images and cues generated from a Large Language Model, then use VLMs to vote on each knowledge element by scoring its visual consistency with the video. In addition to a single scene, temporal dependencies of object states across scenes are also captured using a generative VLM. Under spatial and temporal constraints, we propose an adaptive knowledge refinement module that iteratively updates knowledge reliability weights to achieve a global consensus in object state inference across the video. Finally, object states are inferred by combining the refined weights with VLM voting results. Experiments on two datasets demonstrate the effectiveness of our method.Downloads
Published
2026-03-14
How to Cite
Qi, Y., & Wu, X. (2026). What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8484-8492. https://doi.org/10.1609/aaai.v40i10.37799
Issue
Section
AAAI Technical Track on Computer Vision VII