AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Authors

  • Shuo Yang Peking University
  • Qihui Zhang Peking University
  • Yuyang Liu Peking University
  • Yue Huang Independent Researcher
  • Xiaojun Jia Nanyang Technological University
  • Kun-Peng Ning Peking University
  • Jia-Yu Yao Peking University
  • Jigang Wang ZTE Corporation
  • Dai Hailiang ZTE Corporation
  • Yibing Song Independent Researcher
  • Li Yuan Peking University Peng Cheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i40.40729

Abstract

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction—defined by weight differences between aligned (safe) and unaligned models—rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

Published

2026-03-14

How to Cite

Yang, S., Zhang, Q., Liu, Y., Huang, Y., Jia, X., Ning, K.-P., … Yuan, L. (2026). AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34322–34330. https://doi.org/10.1609/aaai.v40i40.40729

Issue

Section

AAAI Technical Track on Natural Language Processing V