SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Somnath Banerjee; Sayan Layek; Soham Tripathy; Shanu Kumar; Animesh Mukherjee; Rima Hazra

doi:10.1609/aaai.v39i26.34927

Authors

Somnath Banerjee Indian Institute of Technology Kharagpur, India
Sayan Layek Indian Institute of Technology Kharagpur, India
Soham Tripathy Indian Institute of Technology Kharagpur, India
Shanu Kumar Microsoft IDC, India
Animesh Mukherjee Indian Institute of Technology Kharagpur, India
Rima Hazra Singapore University of Technology and Design, Singapore

DOI:

https://doi.org/10.1609/aaai.v39i26.34927

Abstract

Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. safeInfer involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information