FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Authors

  • Zichen Tang Beijing University of Posts and Telecommunications
  • Haihong E Beijing University of Posts and Telecommunications
  • Rongjin Li Beijing University of Posts and Telecommunications
  • Jiacheng Liu Beijing University of Posts and Telecommunications
  • Linwei Jia Beijing University of Posts and Telecommunications
  • Zhuodi Hao Beijing University of Posts and Telecommunications
  • Zhongjun Yang Beijing University of Posts and Telecommunications
  • Yuanze Li Beijing University of Posts and Telecommunications
  • Haolin Tian Beijing University of Posts and Telecommunications
  • Xinyi Hu Beijing University of Posts and Telecommunications
  • Peizhi Zhao Beijing University of Posts and Telecommunications
  • Yuan Liu Beijing University of Posts and Telecommunications
  • Zhengyu Wang Beijing University of Posts and Telecommunications
  • Xianghe Wang Beijing University of Posts and Telecommunications
  • Yiling Huang Beijing University of Posts and Telecommunications
  • Xueyuan Lin Hithink RoyalFlush Information Network Co., Ltd.
  • Ruofei Bai Beijing University of Posts and Telecommunications
  • Zijian Xie Beijing University of Posts and Telecommunications
  • Qian Huang Beijing University of Posts and Telecommunications
  • Ruining Cao Beijing University of Posts and Telecommunications
  • Haocheng Gao Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i30.39785

Abstract

We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

Published

2026-03-14

How to Cite

Tang, Z., E, H., Li, R., Liu, J., Jia, L., Hao, Z., … Gao, H. (2026). FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25858–25866. https://doi.org/10.1609/aaai.v40i30.39785

Issue

Section

AAAI Technical Track on Machine Learning VII