Knowledge-Enhanced Explainable Prompting for Vision-Language Models

Yequan Bie; Andong Tan; Zhixuan Chen; Zhiyuan Cai; Luyang Luo; Hao Chen

doi:10.1609/aaai.v40i4.37233

Authors

Yequan Bie Department of Computer Science and Engineering, Hong Kong University of Science and Technology
Andong Tan Department of Computer Science and Engineering, Hong Kong University of Science and Technology
Zhixuan Chen Department of Computer Science and Engineering, Hong Kong University of Science and Technology
Zhiyuan Cai Department of Computer Science and Engineering, Hong Kong University of Science and Technology
Luyang Luo Department of Biomedical Informatics, Harvard University
Hao Chen Department of Computer Science and Engineering, Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i4.37233

Abstract

Large-scale vision-language models (VLMs) embedded with expansive representations and visual concepts have showcased significant potential in image and text understanding. Efficiently adapting VLMs such as CLIP to downstream tasks like few-shot image classification has garnered growing attention, with prompt learning emerging as a representative approach. However, most existing prompt-based adaptation methods, which rely solely on coarse-grained textual prompts, suffer from limited performance and interpretability when handling domain tasks that require specific knowledge. This results in a failure to satisfy the stringent trustworthiness requirements of Explainable Artificial Intelligence (XAI) in high-risk scenarios like healthcare. To address this issue, we propose a Knowledge-Enhanced Explainable Prompting (KEEP) framework that leverages fine-grained domain-specific knowledge to enhance the adaptation process of VLMs across various domains and image modalities. By incorporating retrieval augmented generation and domain foundation models, our framework can provide more reliable image-wise knowledge for prompt learning in various domains, alleviating the lack of fine-grained annotations, while offering both visual and textual explanations. Extensive experiments and explainability analyses conducted on eight datasets of different domains and image modalities demonstrate that our method simultaneously achieves superior performance and interpretability, highlighting the effectiveness of the collaboration between foundation models and XAI.

Knowledge-Enhanced Explainable Prompting for Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information