Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor

Qi Zhao; Christian Wressnegger

doi:10.1609/aaai.v39i21.34441

Authors

Qi Zhao Karlsruhe Institute of Technology
Christian Wressnegger Karlsruhe Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i21.34441

Abstract

The community has recently developed various training-time defenses to counter neural backdoors introduced through data poisoning. In light of the observation that a model learns poisonous samples responsible for the backdoor easier than benign samples, these approaches either use a fixed threshold of the training loss for splitting or iteratively learn a reference model as an oracle for identifying benign samples. In particular, the latter has proven effective for anti-backdoor learning. Our method, HARVEY, leverages a similar yet crucially different technique: learning an oracle for poisonous rather than benign samples. Learning a backdoored reference model is significantly easier than learning one on benign data. Consequently, we can identify poisonous samples much more accurately than related work identifies benign samples. This crucial difference enables near-perfect backdoor removal as we demonstrate in our evaluation. HARVEY substantially outperforms related approaches across attack types, datasets, and architectures, lowering the attack success rate to the very minimum at a negligible loss in natural accuracy.

Two Sides of the Same Coin: Learning the Backdoor to Remove the Backdoor

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information