Correcting Large Language Model Behavior via Influence Function

Authors

  • Han Zhang Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
  • Zhuo Zhang Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
  • Yi Zhang Pengcheng Laboratory
  • Yuanzhao Zhai National University of Defense Technology
  • Hanyang Peng Pengcheng Laboratory
  • Yu Lei Pengcheng Laboratory
  • Yue Yu Pengcheng Laboratory
  • Hui Wang Pengcheng Laboratory
  • Bin Liang The Chinese University of Hong Kong
  • Lin Gui King's College London
  • Ruifeng Xu Harbin Institute of Technology Pengcheng Laboratory Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies

DOI:

https://doi.org/10.1609/aaai.v39i13.33586

Abstract

Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, either curation of new data for continual alignment or manual correction of outdated data for re-alignment, demand costly human resources. To address this, we propose a novel approach, LLM BehAvior Correction with INfluence FunCtion REcall and Post-Training (LANCET), which needs no human involvement. LANCET consists of two phases: (1) using a new method LinFAC to efficiently identify the training data that significantly impact undesirable model outputs, and (2) applying an novel Influence-driven Bregman Optimization (IBO) technique to adjust the model’s outputs based on these influence distributions. Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. Further more, LANCET exhibits stronger generalization ability than all baselines under out-of-distribution harmful prompts, offering better interpretability and compatibility with real-world applications of LLMs.

Published

2025-04-11

How to Cite

Zhang, H., Zhang, Z., Zhang, Y., Zhai, Y., Peng, H., Lei, Y., … Xu, R. (2025). Correcting Large Language Model Behavior via Influence Function. Proceedings of the AAAI Conference on Artificial Intelligence, 39(13), 14477–14485. https://doi.org/10.1609/aaai.v39i13.33586

Issue

Section

AAAI Technical Track on Humans and AI