Needle in a Patched Haystack: Evaluating Saliency Maps for Vision LLMs
DOI:
https://doi.org/10.1609/aies.v8i3.36763Abstract
ColPali recently proposed a method for explaining multimodal retrieval-augmented generation (RAG) by visualizing how vision–language models (VLMs) connect image patches to text tokens. However, our theoretical analysis and experiments show that these similarity-based saliency maps are fragile and often misleading. We therefore caution against relying solely on intuitive visualizations and present a principled patch-level dissection technique that traces how vision LLMs actually accumulate evidence across modalities. To address this issue, we introduce Needle-in-a-Patched-Haystack: a patch-centered dataset and metric suite that quantifies transparency by benchmarking localization performance in vision LLMs. Together, our analysis and toolkit establish a stricter standard for VLM interpretability and provide a drop-in evaluation protocol for future research on robust, multimodal explanations.Downloads
Published
2025-10-15
How to Cite
Zimmermann, B., & Boussard, M. (2025). Needle in a Patched Haystack: Evaluating Saliency Maps for Vision LLMs. Proceedings of the AAAI ACM Conference on AI, Ethics, and Society, 8(3), 2832–2839. https://doi.org/10.1609/aies.v8i3.36763