Paper Folding Puzzles: Can Multimodal Large Language Models Perform Spatial Reasoning?

Authors

  • Dibin Zhou Hangzhou Normal University
  • Yantao Xu Hangzhou Normal University
  • Zongming Huang Hangzhou Normal University
  • Zengwei Yan Hangzhou Normal University
  • Wenhao Liu Hangzhou Normal University
  • Yongwei Miao Hangzhou Normal University
  • Jianfeng Ren University of Nottingham Ningbo China
  • Fuchang Liu Hangzhou Normal University

DOI:

https://doi.org/10.1609/aaai.v40i16.38364

Abstract

Multimodal Large Language Models (MLLMs) largely lag human-level performance on abstract visual reasoning (AVR), which requires models to infer latent rules from visual question sets and generalize them to novel scenarios. Most AVR benchmarks are constrained to narrow and repetitive 2D patterns, involving relatively simple spatial relationships and assessing limited dimensions of reasoning ability. Drawing inspiration from real-world paper folding challenges, we propose Paper Folding Puzzles (PFP), a rigorously designed benchmark specifically developed to assess spatial reasoning capabilities. It comprises 150K visual question-answering samples across five diverse tasks, ranging from basic 2D geometric reasoning to 3D spatial understanding. The developed benchmark dataset can be employed to assess core spatial reasoning abilities essential to human cognition, encompassing fundamental symmetry reasoning and 3D spatial comprehension. Furthermore, we conduct a comprehensive evaluation of 18 leading MLLMs (both closed- and open-source variants) on the PFP benchmark to assess their spatial reasoning capabilities. Our findings show that most MLLMs achieve near-chance performance on FPF, exhibiting substantial performance gaps (>30%) relative to human baselines across all tasks. This highlights a critical research gap in improving spatial reasoning capabilities of MLLMs.

Published

2026-03-14

How to Cite

Zhou, D., Xu, Y., Huang, Z., Yan, Z., Liu, W., Miao, Y., … Liu, F. (2026). Paper Folding Puzzles: Can Multimodal Large Language Models Perform Spatial Reasoning?. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13584–13592. https://doi.org/10.1609/aaai.v40i16.38364

Issue

Section

AAAI Technical Track on Computer Vision XIII