ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Pengze Li; Jiaqi Liu; Junchi Yu; Lihao Liu; Mingyu Ding; Wanli Ouyang; Shixiang Tang; Xi Chen

doi:10.1609/aaai.v40i3.37170

Authors

Pengze Li Artificial Intelligence Innovation and Incubation Institute of Fudan University Shanghai Artificial Intelligence Laboratory
Jiaqi Liu Department of Computer Science, University of North Carolina at Chapel Hill Shanghai Artificial Intelligence Laboratory
Junchi Yu University of Oxford
Lihao Liu Shanghai Artificial Intelligence Laboratory
Mingyu Ding Department of Computer Science, University of North Carolina at Chapel Hill
Wanli Ouyang Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong
Shixiang Tang Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong
Xi Chen Artificial Intelligence Innovation and Incubation Institute of Fudan University Shanghai Academy of AI for Science

DOI:

https://doi.org/10.1609/aaai.v40i3.37170

Abstract

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In an RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information