SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
DOI:
https://doi.org/10.1609/aaai.v39i26.34927Abstract
Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. safeInfer involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.Downloads
Published
2025-04-11
How to Cite
Banerjee, S., Layek, S., Tripathy, S., Kumar, S., Mukherjee, A., & Hazra, R. (2025). SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27188–27196. https://doi.org/10.1609/aaai.v39i26.34927
Issue
Section
AAAI Technical Track on AI Alignment