A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Authors

  • Trenton Chang University of Michigan
  • Tobias Schnabel Microsoft Research
  • Adith Swaminathan Netflix
  • Jenna Wiens University of Michigan

DOI:

https://doi.org/10.1609/aaai.v40i44.41057

Abstract

Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. We highlight two gaps in current LLM evaluations for assessing steerability. First, many benchmarks are built with past LLM chats and text scraped from the Internet, which may skew towards common requests, underrepresenting less-common requests by potential users. Second, prior work measures performance as a scalar, which could conceal behavioral shifts in LLM outputs in open-ended generation. To mitigate these gaps, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce intended changes or "side-effects" to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient.

Published

2026-03-14

How to Cite

Chang, T., Schnabel, T., Swaminathan, A., & Wiens, J. (2026). A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37259–37267. https://doi.org/10.1609/aaai.v40i44.41057

Issue

Section

AAAI Special Track on AI Alignment