LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Huimin Ren; Yan Liang; Baiqiao Su; Chaobo Sun; Hengtong Lu; Kaike Zhang; Chen Wei

doi:10.1609/aaai.v40i30.39701

Authors

Huimin Ren Li Auto Inc.
Yan Liang Beijing University of Posts and Telecommunications
Baiqiao Su Beijing University of Posts and Telecommunications
Chaobo Sun Li Auto Inc.
Hengtong Lu Li Auto Inc.
Kaike Zhang Li Auto Inc.
Chen Wei Li Auto Inc.

DOI:

https://doi.org/10.1609/aaai.v40i30.39701

Abstract

The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated ``LLM-as-a-judge'' systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical (Procedure, Relation, Value) triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.

LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information