Correcting Large Language Model Behavior via Influence Function

Han Zhang; Zhuo Zhang; Yi Zhang; Yuanzhao Zhai; Hanyang Peng; Yu Lei; Yue Yu; Hui Wang; Bin Liang; Lin Gui; Ruifeng Xu

doi:10.1609/aaai.v39i13.33586

Authors

Han Zhang Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
Zhuo Zhang Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
Yi Zhang Pengcheng Laboratory
Yuanzhao Zhai National University of Defense Technology
Hanyang Peng Pengcheng Laboratory
Yu Lei Pengcheng Laboratory
Yue Yu Pengcheng Laboratory
Hui Wang Pengcheng Laboratory
Bin Liang The Chinese University of Hong Kong
Lin Gui King's College London
Ruifeng Xu Harbin Institute of Technology Pengcheng Laboratory Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies

DOI:

https://doi.org/10.1609/aaai.v39i13.33586

Abstract

Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, either curation of new data for continual alignment or manual correction of outdated data for re-alignment, demand costly human resources. To address this, we propose a novel approach, LLM BehAvior Correction with INfluence FunCtion REcall and Post-Training (LANCET), which needs no human involvement. LANCET consists of two phases: (1) using a new method LinFAC to efficiently identify the training data that significantly impact undesirable model outputs, and (2) applying an novel Influence-driven Bregman Optimization (IBO) technique to adjust the model’s outputs based on these influence distributions. Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. Further more, LANCET exhibits stronger generalization ability than all baselines under out-of-distribution harmful prompts, offering better interpretability and compatibility with real-world applications of LLMs.

Correcting Large Language Model Behavior via Influence Function

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information