CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Authors

  • Zihui Cheng School of Computer Science and Engineering, Central South University, China Key Laboratory of Data Intelligence and Advanced Computing in Provincial Universities, Soochow University, China
  • Qiguang Chen Research Center for SCIR, Harbin Institute of Technology, Harbin, China
  • Jin Zhang Research Center for SCIR, Harbin Institute of Technology, Harbin, China
  • Hao Fei National University of Singapore, Singapore
  • Xiaocheng Feng Research Center for SCIR, Harbin Institute of Technology, Harbin, China
  • Wanxiang Che Research Center for SCIR, Harbin Institute of Technology, Harbin, China
  • Min Li School of Computer Science and Engineering, Central South University, China
  • Libo Qin School of Computer Science and Engineering, Central South University, China Key Laboratory of Data Intelligence and Advanced Computing in Provincial Universities, Soochow University, China

DOI:

https://doi.org/10.1609/aaai.v39i22.34538

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

Downloads

Published

2025-04-11

How to Cite

Cheng, Z., Chen, Q., Zhang, J., Fei, H., Feng, X., Che, W., … Qin, L. (2025). CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22), 23678–23686. https://doi.org/10.1609/aaai.v39i22.34538

Issue

Section

AAAI Technical Track on Natural Language Processing I