Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor

Authors

  • Qi Zhao Karlsruhe Institute of Technology
  • Christian Wressnegger Karlsruhe Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i21.34441

Abstract

The community has recently developed various training-time defenses to counter neural backdoors introduced through data poisoning. In light of the observation that a model learns poisonous samples responsible for the backdoor easier than benign samples, these approaches either use a fixed threshold of the training loss for splitting or iteratively learn a reference model as an oracle for identifying benign samples. In particular, the latter has proven effective for anti-backdoor learning. Our method, HARVEY, leverages a similar yet crucially different technique: learning an oracle for poisonous rather than benign samples. Learning a backdoored reference model is significantly easier than learning one on benign data. Consequently, we can identify poisonous samples much more accurately than related work identifies benign samples. This crucial difference enables near-perfect backdoor removal as we demonstrate in our evaluation. HARVEY substantially outperforms related approaches across attack types, datasets, and architectures, lowering the attack success rate to the very minimum at a negligible loss in natural accuracy.

Published

2025-04-11

How to Cite

Zhao, Q., & Wressnegger, C. (2025). Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor. Proceedings of the AAAI Conference on Artificial Intelligence, 39(21), 22804–22812. https://doi.org/10.1609/aaai.v39i21.34441

Issue

Section

AAAI Technical Track on Machine Learning VII