What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos

Authors

  • Yayun Qi Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
  • Xinxiao Wu Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

DOI:

https://doi.org/10.1609/aaai.v40i10.37799

Abstract

Object state understanding aims at recognizing the co-occurrence and transitions of multiple object states in videos. While learning from videos handles seen object states well, it struggles with novel ones. We address this task in a zero-shot setting by extracting state-specific knowledge from pre-trained models and using Vision-Language Models (VLMs) to verify whether such knowledge is visually grounded in videos. However, the extracted knowledge varies in its ability to distinguish states, and VLM observations are not always trustworthy. To address this issue, we propose a trust-aware knowledge-guided method to model knowledge trustworthiness and emphasize highly discriminative knowledge that VLMs can reliably observe. Specifically, we collect spatial knowledge for each object state from retrieved images and cues generated from a Large Language Model, then use VLMs to vote on each knowledge element by scoring its visual consistency with the video. In addition to a single scene, temporal dependencies of object states across scenes are also captured using a generative VLM. Under spatial and temporal constraints, we propose an adaptive knowledge refinement module that iteratively updates knowledge reliability weights to achieve a global consensus in object state inference across the video. Finally, object states are inferred by combining the refined weights with VLM voting results. Experiments on two datasets demonstrate the effectiveness of our method.

Published

2026-03-14

How to Cite

Qi, Y., & Wu, X. (2026). What to Trust? A Trust-aware Knowledge-guided Method for Zero-shot Object State Understanding in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8484-8492. https://doi.org/10.1609/aaai.v40i10.37799

Issue

Section

AAAI Technical Track on Computer Vision VII