LLM Forensic Evaluation: Diagnosing Actionability, Uncertainty, and Human Comprehension in High-Stakes Outputs

Jaye Nias; Saurav K. Aryal; Christopher Watson; Jeremy Blackstone; Simone A. Smarr; Lucretia Williams; Gloria Washington

doi:10.1609/aaaiss.v8i1.42520

Authors

Jaye Nias Howard University
Saurav K. Aryal Howard University
Christopher Watson Howard University
Jeremy Blackstone Howard University
Simone A. Smarr Howard University
Lucretia Williams Howard University
Gloria Washington Howard University

DOI:

https://doi.org/10.1609/aaaiss.v8i1.42520

Abstract

Large language models are increasingly incorporated into decision support workflows to summarize situations, propose actions, and communicate rationale. These capabilities are valuable in time-sensitive environments, but they also introduce risks related to hallucination, overconfidence, and contextual misalignment. This paper presents Project Comprehension, a forensic evaluation framework that treats language model outputs as artifacts for post hoc analysis rather than isolated successes or failures. Project Comprehension integrates structured empirical probing across operationally grounded scenarios with human-centered annotation instruments designed to capture interpretability and perceived uncertainty. We report early results from empirical testing and scale validation using a labeling set developed to support reliable forensic judgments of model behavior. We describe a failure mode taxonomy for reasoning and communication breakdowns, and we illustrate how forensic insights can inform assurance practices, trust calibration, and human autonomy teaming. The paper concludes with recommendations for building forensic readiness into language-enabled systems used in high-stakes decision support.

LLM Forensic Evaluation: Diagnosing Actionability, Uncertainty, and Human Comprehension in High-Stakes Outputs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information