A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Trenton Chang; Tobias Schnabel; Adith Swaminathan; Jenna Wiens

doi:10.1609/aaai.v40i44.41057

Authors

Trenton Chang University of Michigan
Tobias Schnabel Microsoft Research
Adith Swaminathan Netflix
Jenna Wiens University of Michigan

DOI:

https://doi.org/10.1609/aaai.v40i44.41057

Abstract

Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. We highlight two gaps in current LLM evaluations for assessing steerability. First, many benchmarks are built with past LLM chats and text scraped from the Internet, which may skew towards common requests, underrepresenting less-common requests by potential users. Second, prior work measures performance as a scalar, which could conceal behavioral shifts in LLM outputs in open-ended generation. To mitigate these gaps, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce intended changes or "side-effects" to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient.

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information