SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Authors

  • Somnath Banerjee Indian Institute of Technology Kharagpur, India
  • Sayan Layek Indian Institute of Technology Kharagpur, India
  • Soham Tripathy Indian Institute of Technology Kharagpur, India
  • Shanu Kumar Microsoft IDC, India
  • Animesh Mukherjee Indian Institute of Technology Kharagpur, India
  • Rima Hazra Singapore University of Technology and Design, Singapore

DOI:

https://doi.org/10.1609/aaai.v39i26.34927

Abstract

Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. safeInfer involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.

Downloads

Published

2025-04-11

How to Cite

Banerjee, S., Layek, S., Tripathy, S., Kumar, S., Mukherjee, A., & Hazra, R. (2025). SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27188–27196. https://doi.org/10.1609/aaai.v39i26.34927

Issue

Section

AAAI Technical Track on AI Alignment