TaxReasoning: Benchmarking Knowledge-Intensive Mathematical Reasoning with Evolving Tax Laws

Authors

  • Nan Hu School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Yike Wu School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Jiaye Li School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • HuiKang Hu School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Guilin Qi School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Songlin Zhai School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Yongrui Chen School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Tianxing Wu School of Computer Science and Engineering, Southeast University, China Key Laboratory of New Generation Artificial Intelligence Technology and its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Tongtong Wu Monash University, Australia
  • Jiaoyan Chen The University of Manchester, United Kingdom
  • Jeff Z. Pan The University of Edinburgh, United Kingdom

DOI:

https://doi.org/10.1609/aaai.v40i37.40367

Abstract

Recent studies have explored the capabilities of large language models (LLMs) in solving knowledge-intensive mathematical reasoning problems. However, existing benchmarks predominantly involve static theorems that LLMs have encountered during pretraining, failing to assess dynamic knowledge integration. In this work, we introduce TaxReasoning, a novel benchmark designed to evaluate LLMs’ abilities in real-world tax calculation scenarios. These tasks require not only mathematical reasoning and numerical computation, but also the extraction and application of complex, frequently updated tax regulations. Through extensive experiments with state-of-the-art LLMs using diverse prompting strategies and knowledge augmentation techniques, we uncover substantial limitations in their ability to handle dynamic, knowledge-intensive questions—primarily due to missing domain-specific knowledge and ineffective retrieval. Even the best-performing models fall significantly short of human-level performance. Our analysis points to key avenues for improvement, including enhancing LLMs' reasoning capabilities, developing more effective knowledge summarization techniques, and improving retrieval strategies. TaxReasoning offers a critical testbed for advancing LLMs in dynamic knowledge-intensive domains.

Downloads

Published

2026-03-14

How to Cite

Hu, N., Wu, Y., Li, J., Hu, H., Qi, G., Zhai, S., … Pan, J. Z. (2026). TaxReasoning: Benchmarking Knowledge-Intensive Mathematical Reasoning with Evolving Tax Laws. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31068–31076. https://doi.org/10.1609/aaai.v40i37.40367

Issue

Section

AAAI Technical Track on Natural Language Processing II