STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Authors

  • Zijun Wang University of California, Santa Cruz
  • Haoqin Tu University of California, Santa Cruz
  • Yuhan Wang University of California, Santa Cruz
  • Juncheng Wu University of California, Santa Cruz
  • Yanqing Liu University of California, Santa Cruz
  • Jieru Mei Google
  • Brian R. Bartoldson Lawrence Livermore National Laboratory
  • Bhavya Kailkhura Lawrence Livermore National Laboratory
  • Cihang Xie University of California, Santa Cruz

DOI:

https://doi.org/10.1609/aaai.v40i44.41136

Abstract

This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles --- diversity, deliberative reasoning, and rigorous filtering --- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.

Published

2026-03-14

How to Cite

Wang, Z., Tu, H., Wang, Y., Wu, J., Liu, Y., Mei, J., … Xie, C. (2026). STAR-1: Safer Alignment of Reasoning LLMs with 1K Data. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37988–37997. https://doi.org/10.1609/aaai.v40i44.41136

Issue

Section

AAAI Special Track on AI Alignment