ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Authors

  • Pengze Li Artificial Intelligence Innovation and Incubation Institute of Fudan University Shanghai Artificial Intelligence Laboratory
  • Jiaqi Liu Department of Computer Science, University of North Carolina at Chapel Hill Shanghai Artificial Intelligence Laboratory
  • Junchi Yu University of Oxford
  • Lihao Liu Shanghai Artificial Intelligence Laboratory
  • Mingyu Ding Department of Computer Science, University of North Carolina at Chapel Hill
  • Wanli Ouyang Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong
  • Shixiang Tang Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong
  • Xi Chen Artificial Intelligence Innovation and Incubation Institute of Fudan University Shanghai Academy of AI for Science

DOI:

https://doi.org/10.1609/aaai.v40i3.37170

Abstract

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In an RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

Downloads

Published

2026-03-14

How to Cite

Li, P., Liu, J., Yu, J., Liu, L., Ding, M., Ouyang, W., … Chen, X. (2026). ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 1900–1908. https://doi.org/10.1609/aaai.v40i3.37170

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems