NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Authors

  • Xin Yi East China Normal University
  • Shunfan Zheng East China Normal University
  • Linlin Wang East China Normal University
  • Gerard de Melo Hasso Plattner Institute University of Potsdam
  • Xiaoling Wang East China Normal University
  • Liang He East China Normal University

DOI:

https://doi.org/10.1609/aaai.v39i24.34762

Abstract

The emergence of fine-tuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the fine-tuning process, leading to a compromised alignment state. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose Neuron-Level Safety Realignment (NLSR), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings indicate that safety-critical neurons exhibit significant regional variations after fine-tuning, which can be effectively corrected through neuron transplantation from the reference model without the need for additional training.

Published

2025-04-11

How to Cite

Yi, X., Zheng, S., Wang, L., de Melo, G., Wang, X., & He, L. (2025). NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25706–25714. https://doi.org/10.1609/aaai.v39i24.34762

Issue

Section

AAAI Technical Track on Natural Language Processing III