Let’s Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Authors

  • Xu Liu Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen School of Computer Science and Engineering, Central South University Text Computing and Cognitive Intelligence Ministry of Education Engineering Research Center, Guizhou University
  • Yongheng Zhang School of Computer Science and Engineering, Central South University
  • Qiguang Chen School of Computer Science and Engineering, Central South University
  • Yao Li Shanghai Aviation Electric Co., Ltd, Aviation Industry Corporation of China, Shanghai
  • Sheng Wang Shanghai Aviation Electric Co., Ltd, Aviation Industry Corporation of China, Shanghai
  • Libo Qin Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen School of Computer Science and Engineering, Central South University Text Computing and Cognitive Intelligence Ministry of Education Engineering Research Center, Guizhou University

DOI:

https://doi.org/10.1609/aaai.v40i38.40494

Abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

Published

2026-03-14

How to Cite

Liu, X., Zhang, Y., Chen, Q., Li, Y., Wang, S., & Qin, L. (2026). Let’s Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32213–32221. https://doi.org/10.1609/aaai.v40i38.40494

Issue

Section

AAAI Technical Track on Natural Language Processing III