What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles

Authors

  • Mengtao Zhou Huazhong University of Science and Technology
  • Sifan Wu Université de Montréal Mila - Quebec AI Institute
  • Huan Zhang Université de Montréal Mila - Quebec AI Institute
  • Qi Sima Huazhong University of Science and Technology
  • Bang Liu Université de Montréal Mila - Quebec AI Institute

DOI:

https://doi.org/10.1609/aaai.v40i41.40819

Abstract

We investigate the capacity of Large Language Models (LLMs) for imaginative reasoning—the proactive construction, testing, and revision of hypotheses in information-sparse environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic "Turtle Soup" game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup stories sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess LLMs' performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency, detail completion, and conclusion alignment. Experiments with leading LLMs reveal clear capability limits, common failure patterns, and a significant performance gap compared to humans. Our work offers new insights into LLMs' imaginative reasoning and establishes a foundation for future research on exploratory agent behavior.

Downloads

Published

2026-03-14

How to Cite

Zhou, M., Wu, S., Zhang, H., Sima, Q., & Liu, B. (2026). What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35130-35138. https://doi.org/10.1609/aaai.v40i41.40819

Issue

Section

AAAI Technical Track on Natural Language Processing VI