Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement

Authors

  • Chongjun Tu College of Future Information Technology, Fudan University
  • Peng Ye Shanghai Artificial Intelligent Laboratory The Chinese University of Hong Kong
  • Dongzhan Zhou Shanghai Artificial Intelligent Laboratory
  • Tao Chen College of Future Information Technology, Fudan University Shanghai Innovation Institute
  • Wanli Ouyang Shanghai Artificial Intelligent Laboratory The Chinese University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i12.37919

Abstract

Current Multimodal Chain-of-Thought (MCoT) methods suffer from low-quality multimodal reasoning, characterized by overthinking on simple queries and inefficient utilization of visual information, resulting in vast inefficient and ineffective computations. In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. First, our selective thinking module employs entropy-based confidence estimation to determine whether queries require detailed reasoning, preventing overthinking on simple questions. Second, our step-wise visual enhancement module strengthens attention to relevant visual regions at each reasoning step without inserting additional tokens, achieving fine-grained visual grounding and enhancement with minimal overhead. Moreover, SDR-MCoT can be seamlessly integrated into various MLLMs, offering a practical solution for improving multimodal reasoning. Comprehensive experiments across eight benchmarks from diverse domains (multimodal reasoning, visual understanding, hallucination, and mathematical reasoning) demonstrate that SDR-MCoT consistently outperforms existing MCoT methods on four different base models with reduced overhead. For instance, on Qwen2-VL-7B, our method improves average accuracy by over 6% while reducing token consumption by approximately 60% compared to zero-shot CoT.

Published

2026-03-14

How to Cite

Tu, C., Ye, P., Zhou, D., Chen, T., & Ouyang, W. (2026). Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 9576-9584. https://doi.org/10.1609/aaai.v40i12.37919

Issue

Section

AAAI Technical Track on Computer Vision IX