LLM Forensic Evaluation: Diagnosing Actionability, Uncertainty, and Human Comprehension in High-Stakes Outputs

Authors

  • Jaye Nias Howard University
  • Saurav K. Aryal Howard University
  • Christopher Watson Howard University
  • Jeremy Blackstone Howard University
  • Simone A. Smarr Howard University
  • Lucretia Williams Howard University
  • Gloria Washington Howard University

DOI:

https://doi.org/10.1609/aaaiss.v8i1.42520

Abstract

Large language models are increasingly incorporated into decision support workflows to summarize situations, propose actions, and communicate rationale. These capabilities are valuable in time-sensitive environments, but they also introduce risks related to hallucination, overconfidence, and contextual misalignment. This paper presents Project Comprehension, a forensic evaluation framework that treats language model outputs as artifacts for post hoc analysis rather than isolated successes or failures. Project Comprehension integrates structured empirical probing across operationally grounded scenarios with human-centered annotation instruments designed to capture interpretability and perceived uncertainty. We report early results from empirical testing and scale validation using a labeling set developed to support reliable forensic judgments of model behavior. We describe a failure mode taxonomy for reasoning and communication breakdowns, and we illustrate how forensic insights can inform assurance practices, trust calibration, and human autonomy teaming. The paper concludes with recommendations for building forensic readiness into language-enabled systems used in high-stakes decision support.

Downloads

Published

2026-05-18

How to Cite

Nias, J., Aryal, S. K., Watson, C., Blackstone, J., Smarr, S. A., Williams, L., & Washington, G. (2026). LLM Forensic Evaluation: Diagnosing Actionability, Uncertainty, and Human Comprehension in High-Stakes Outputs. Proceedings of the AAAI Symposium Series, 8(1), 74–82. https://doi.org/10.1609/aaaiss.v8i1.42520

Issue

Section

Advances in AI-Enabled Tactical Autonomy