Activating Visual Context and Commonsense Reasoning Through Masked Prediction in VLMs

Authors

  • Jiaao Yu East China Normal University
  • Shenwei Li East China Normal University
  • Mingjie Han East China Normal University
  • Yifei Yin East China Normal University
  • Wenzheng Song East China Normal University
  • Chenghao Jia East China Normal University
  • Man Lan East China Normal University

DOI:

https://doi.org/10.1609/aaai.v40i33.40019

Abstract

Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real-world multimodal scenarios, most notably, vision-language tasks, due to a heavy focus on single-modal language settings. While efforts to transplant reinforcement learning techniques from NLP to Visual Language Models (VLMs) have emerged, these approaches often remain confined to perception-centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine-tuning task, Masked Prediction via Context and Commonsense (MPCC), which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model’s performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC-Eval, and employed various fine-tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine-Tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in out-of-distribution (OOD) and cross-task scenarios.

Downloads

Published

2026-03-14

How to Cite

Yu, J., Li, S., Han, M., Yin, Y., Song, W., Jia, C., & Lan, M. (2026). Activating Visual Context and Commonsense Reasoning Through Masked Prediction in VLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 27952–27960. https://doi.org/10.1609/aaai.v40i33.40019

Issue

Section

AAAI Technical Track on Machine Learning X