Beyond Verdicts: Evaluating Language Model Moral Competence

Aaron J Snoswell; Daniel Kilov; Seth Lazar

doi:10.1609/aaai.v40i44.41131

Authors

Aaron J Snoswell Digital Media Research Centre GenAI Lab, Queensland University of Technology, Kelvin Grove, QLD 4012, Australia
Daniel Kilov Machine Intelligence and Normative Theory Lab, Australian National University, Acton, ACT 2601, Australia
Seth Lazar Machine Intelligence and Normative Theory Lab, Australian National University, Acton, ACT 2601, Australia

DOI:

https://doi.org/10.1609/aaai.v40i44.41131

Abstract

As Large Language Models (LLMs) are increasingly deployed as Artificial Moral Advisors and autonomous agents making ethical decisions, evaluating their moral competence has become critical. However, existing evaluations may inadequately assess the moral reasoning capabilities needed for real-world deployment, focusing primarily on whether models can match human judgments on carefully curated ethical scenarios. We surveyed 69 papers evaluating LLM ethical competence (2020-2025) and developed a taxonomy categorizing evaluations across datasets, behaviors, and metrics. Our comprehensive analysis maps the methodological landscape of this rapidly growing field and reveals several critical limitations. Most significantly, the vast majority of studies rely on pre-packaged scenarios that highlight morally relevant features, failing to test models' ability to identify ethical considerations in noisy, realistic contexts-what we term "moral sensitivity". Additionally, evaluations overemphasize verdict accuracy rather than assessing moral reasoning quality and steerability, with few studies testing whether models can be appropriately guided toward different ethical frameworks. Most studies rely on "ground truth" comparisons despite philosophical arguments that reasonable moral pluralism precludes definitive moral ground truth. In light of these gaps, we argue for a significant methodological shift: moving from curated scenarios to unfiltered information streams, from verdict accuracy to reasoning quality and steerability, and from ground truth metrics to assessments of reasonableness and consistency. This reorientation is essential for developing AI systems that can navigate moral complexity in real-world deployment scenarios.

Beyond Verdicts: Evaluating Language Model Moral Competence

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information