Large Language Models Struggle with Unreasonability in Math Problems

Authors

  • Jingyuan Ma State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Damai Dai State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Zihang Yuan Institute of Artificial Intelligence, Beihang University
  • Rui Li State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Weilin Luo Huawei Noah's Ark Lab, China
  • Bin Wang Huawei Noah's Ark Lab, China
  • Qun Liu Huawei Noah's Ark Lab, China
  • Lei Sha Institute of Artificial Intelligence, Beihang University
  • Zhifang Sui State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i38.40518

Abstract

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o struggle on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs, with minor and acceptable trade-offs, making them practical solutions in this challenging setting.

Downloads

Published

2026-03-14

How to Cite

Ma, J., Dai, D., Yuan, Z., Li, R., Luo, W., Wang, B., Liu, Q., Sha, L., & Sui, Z. (2026). Large Language Models Struggle with Unreasonability in Math Problems. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32428-32436. https://doi.org/10.1609/aaai.v40i38.40518

Issue

Section

AAAI Technical Track on Natural Language Processing III