Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka; Keita Saito; Mikoto Kudo; Takumi Tanabe; Akifumi Wachi; Youhei Akimoto

doi:10.1609/aaai.v40i44.41087

Authors

Shigeki Kusaka University of Tsukuba
Keita Saito University of Tsukuba
Mikoto Kudo University of Tsukuba RIKEN Center for Advanced Intelligence Project
Takumi Tanabe LY Corporation
Akifumi Wachi LY Corporation
Youhei Akimoto University of Tsukuba RIKEN Center for Advanced Intelligence Project Institute of Science Tokyo

DOI:

https://doi.org/10.1609/aaai.v40i44.41087

Abstract

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM’s policy toward an attacker’s target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model’s feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information