ReFF: Reinforcing Format Faithfulness in Language Models Across Varied Tasks

Authors

  • Jiashu Yao Beijing Institute of Technology
  • Heyan Huang Beijing Institute of Technology
  • Zeming Liu Beihang University
  • Haoyu Wen Beijing Institute of Technology
  • Wei Su Beijing Institute of Technology
  • Boao Qian Beijing Institute of Technology
  • Yuhang Guo Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i24.34757

Abstract

Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.

Published

2025-04-11

How to Cite

Yao, J., Huang, H., Liu, Z., Wen, H., Su, W., Qian, B., & Guo, Y. (2025). ReFF: Reinforcing Format Faithfulness in Language Models Across Varied Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25660–25668. https://doi.org/10.1609/aaai.v39i24.34757

Issue

Section

AAAI Technical Track on Natural Language Processing III