Knowledge-Enhanced Explainable Prompting for Vision-Language Models

Authors

  • Yequan Bie Department of Computer Science and Engineering, Hong Kong University of Science and Technology
  • Andong Tan Department of Computer Science and Engineering, Hong Kong University of Science and Technology
  • Zhixuan Chen Department of Computer Science and Engineering, Hong Kong University of Science and Technology
  • Zhiyuan Cai Department of Computer Science and Engineering, Hong Kong University of Science and Technology
  • Luyang Luo Department of Biomedical Informatics, Harvard University
  • Hao Chen Department of Computer Science and Engineering, Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i4.37233

Abstract

Large-scale vision-language models (VLMs) embedded with expansive representations and visual concepts have showcased significant potential in image and text understanding. Efficiently adapting VLMs such as CLIP to downstream tasks like few-shot image classification has garnered growing attention, with prompt learning emerging as a representative approach. However, most existing prompt-based adaptation methods, which rely solely on coarse-grained textual prompts, suffer from limited performance and interpretability when handling domain tasks that require specific knowledge. This results in a failure to satisfy the stringent trustworthiness requirements of Explainable Artificial Intelligence (XAI) in high-risk scenarios like healthcare. To address this issue, we propose a Knowledge-Enhanced Explainable Prompting (KEEP) framework that leverages fine-grained domain-specific knowledge to enhance the adaptation process of VLMs across various domains and image modalities. By incorporating retrieval augmented generation and domain foundation models, our framework can provide more reliable image-wise knowledge for prompt learning in various domains, alleviating the lack of fine-grained annotations, while offering both visual and textual explanations. Extensive experiments and explainability analyses conducted on eight datasets of different domains and image modalities demonstrate that our method simultaneously achieves superior performance and interpretability, highlighting the effectiveness of the collaboration between foundation models and XAI.

Downloads

Published

2026-03-14

How to Cite

Bie, Y., Tan, A., Chen, Z., Cai, Z., Luo, L., & Chen, H. (2026). Knowledge-Enhanced Explainable Prompting for Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2471–2479. https://doi.org/10.1609/aaai.v40i4.37233

Issue

Section

AAAI Technical Track on Computer Vision I