NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Xin Yi; Shunfan Zheng; Linlin Wang; Gerard de Melo; Xiaoling Wang; Liang He

doi:10.1609/aaai.v39i24.34762

Authors

Xin Yi East China Normal University
Shunfan Zheng East China Normal University
Linlin Wang East China Normal University
Gerard de Melo Hasso Plattner Institute University of Potsdam
Xiaoling Wang East China Normal University
Liang He East China Normal University

DOI:

https://doi.org/10.1609/aaai.v39i24.34762

Abstract

The emergence of fine-tuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the fine-tuning process, leading to a compromised alignment state. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose Neuron-Level Safety Realignment (NLSR), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings indicate that safety-critical neurons exhibit significant regional variations after fine-tuning, which can be effectively corrected through neuron transplantation from the reference model without the need for additional training.

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information