Scaling Trends for Data Poisoning in LLMs

Authors

  • Dillon Bowen FAR.AI
  • Brendan Murphy FAR.AI
  • Will Cai University of California, Berkeley
  • David Khachaturov University of Cambridge
  • Adam Gleave FAR.AI
  • Kellin Pelrine FAR.AI McGill University; Mila

DOI:

https://doi.org/10.1609/aaai.v39i26.34929

Abstract

LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data. We demonstrate that GPT models remain vulnerable to fine-tuning on poisoned data, even when safeguarded by moderation systems. Given the persistence of data poisoning vulnerabilities in today's most capable models, this paper investigates whether these risks increase with model scaling. We evaluate three threat models—malicious fine-tuning, imperfect data curation, and intentional data contamination—across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Downloads

Published

2025-04-11

How to Cite

Bowen, D., Murphy, B., Cai, W., Khachaturov, D., Gleave, A., & Pelrine, K. (2025). Scaling Trends for Data Poisoning in LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27206–27214. https://doi.org/10.1609/aaai.v39i26.34929

Issue

Section

AAAI Technical Track on AI Alignment