DAS ANTAR, Anindya; HUAN, Xun; BANOVIC, Nikola. "Do Your Guardrails Even Guard?’’ Method for Evaluating Effectiveness of Moderation Guardrails in Aligning LLM Outputs with Expert User Expectations. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, [S. l.], v. 8, n. 1, p. 705–718, 2025. DOI: 10.1609/aies.v8i1.36583. Disponível em: https://ojs.aaai.org/index.php/AIES/article/view/36583. Acesso em: 27 may. 2026.