Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Yuchen Zhou; Jiayu Tang; Shuo Yang; Xiaoyan Xiao; Yuqin Dai; Wenhao Yang; Chao Gou; Xiaobo Xia; Tat-Seng Chua

doi:10.1609/aaai.v40i34.40143

Authors

Yuchen Zhou Sun Yat-Sen University National University of Singapore
Jiayu Tang Sun Yat-Sen University
Shuo Yang Harbin Institute of Technology (Shenzhen)
Xiaoyan Xiao Sun Yat-Sen University
Yuqin Dai Tsinghua University
Wenhao Yang Nanjing University
Chao Gou Sun Yat-Sen University
Xiaobo Xia National University of Singapore
Tat-Seng Chua National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v40i34.40143

Abstract

Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical **''logical blindspots''** that limit their reliability in practical applications. To systematically diagnose this, we introduce **LogicBench**, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose **LogicCLIP**, a novel training framework designed to boost VLMs' logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP's substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information