"Do Your Guardrails Even Guard?'' Method for Evaluating Effectiveness of Moderation Guardrails in Aligning LLM Outputs with Expert User Expectations

Authors

  • Anindya Das Antar University of Michigan, Ann Arbor
  • Xun Huan University of Michigan, Ann Arbor
  • Nikola Banovic University of Michigan, Ann Arbor

DOI:

https://doi.org/10.1609/aies.v8i1.36583

Abstract

Ensuring that large language models (LLMs) align with human values and goals is crucial for their adoption in high-stakes decision-making. To guard against incorrect, misleading, or otherwise unexpected or undesirable LLM outputs, guardrail engineers implement guardrails based on expert knowledge from subject-matter authorities to steer and align pre-trained LLMs. Existing evaluation methods assess LLM performance, with and without guardrails, but provide limited insight into the contribution of each individual guardrail and its interactions on alignment. Here, we present a method to evaluate and select guardrails that best align LLM outputs with empirical evidence representing expert knowledge. Through evaluation with real-world illustrative examples of resume quality and recidivism prediction, we show that our method effectively identifies useful moderation guardrails in a way that could help guardrail engineers interpret contributions of different guardrails to "user-LLM" alignment.

Downloads

Published

2025-10-15

How to Cite

Das Antar, A., Huan, X., & Banovic, N. (2025). "Do Your Guardrails Even Guard?’’ Method for Evaluating Effectiveness of Moderation Guardrails in Aligning LLM Outputs with Expert User Expectations. Proceedings of the AAAI ACM Conference on AI, Ethics, and Society, 8(1), 705–718. https://doi.org/10.1609/aies.v8i1.36583