CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Authors

  • Guanghao Zhang Alibaba Group
  • Tao Zhong Alibaba Group
  • Yan Xia Zhejiang University Alibaba Group
  • Mushui Liu Zhejiang University Alibaba Group
  • Zhelun Yu Alibaba Group
  • Haoyuan Li Alibaba Group
  • Wanggui He Alibaba Group
  • Dong She Alibaba Group
  • Yi Wang Zhejiang University Alibaba Group
  • Hao Jiang Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v40i15.38236

Abstract

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model’s reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

Downloads

Published

2026-03-14

How to Cite

Zhang, G., Zhong, T., Xia, Y., Liu, M., Yu, Z., Li, H., … Jiang, H. (2026). CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12430–12438. https://doi.org/10.1609/aaai.v40i15.38236

Issue

Section

AAAI Technical Track on Computer Vision XII