Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement

Chongjun Tu; Peng Ye; Dongzhan Zhou; Tao Chen; Wanli Ouyang

doi:10.1609/aaai.v40i12.37919

Authors

Chongjun Tu College of Future Information Technology, Fudan University
Peng Ye Shanghai Artificial Intelligent Laboratory The Chinese University of Hong Kong
Dongzhan Zhou Shanghai Artificial Intelligent Laboratory
Tao Chen College of Future Information Technology, Fudan University Shanghai Innovation Institute
Wanli Ouyang Shanghai Artificial Intelligent Laboratory The Chinese University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i12.37919

Abstract

Current Multimodal Chain-of-Thought (MCoT) methods suffer from low-quality multimodal reasoning, characterized by overthinking on simple queries and inefficient utilization of visual information, resulting in vast inefficient and ineffective computations. In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. First, our selective thinking module employs entropy-based confidence estimation to determine whether queries require detailed reasoning, preventing overthinking on simple questions. Second, our step-wise visual enhancement module strengthens attention to relevant visual regions at each reasoning step without inserting additional tokens, achieving fine-grained visual grounding and enhancement with minimal overhead. Moreover, SDR-MCoT can be seamlessly integrated into various MLLMs, offering a practical solution for improving multimodal reasoning. Comprehensive experiments across eight benchmarks from diverse domains (multimodal reasoning, visual understanding, hallucination, and mathematical reasoning) demonstrate that SDR-MCoT consistently outperforms existing MCoT methods on four different base models with reduced overhead. For instance, on Qwen2-VL-7B, our method improves average accuracy by over 6% while reducing token consumption by approximately 60% compared to zero-shot CoT.

Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information