From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks

Authors

  • Xianda Zhang Department of Computer Science and Technology, Peking University Advanced Institute of Big Data
  • Baolin Zheng Alibaba Group
  • Jianbao Hu University of Glasgow
  • Chengyang Li Department of Computer Science and Technology, Peking University
  • Xiaoying Bai Advanced Institute of Big Data

DOI:

https://doi.org/10.1609/aaai.v38i15.29629

Keywords:

ML: Adversarial Learning & Robustness, CV: Adversarial Attacks & Robustness

Abstract

Despite the tremendous success of deep neural networks (DNNs) across various fields, their susceptibility to potential backdoor attacks seriously threatens their application security, particularly in safety-critical or security-sensitive ones. Given this growing threat, there is a pressing need for research into purging backdoors from DNNs. However, prior efforts on erasing backdoor triggers not only failed to withstand increasingly powerful attacks but also resulted in reduced model performance. In this paper, we propose From Toxic to Trustworthy (FTT), an innovative approach to eliminate backdoor triggers while simultaneously enhancing model accuracy. Following the stringent and practical assumption of limited availability of clean data, we introduce a self-attention distillation (SAD) method to remove the backdoor by aligning the shallow and deep parts of the network. Furthermore, we first devise a semi-supervised learning (SSL) method that leverages ubiquitous and available poisoned data to further purify backdoors and improve accuracy. Extensive experiments on various attacks and models have shown that our FTT can reduce the attack success rate from 97% to 1% and improve the accuracy of 4% on average, demonstrating its effectiveness in mitigating backdoor attacks and improving model performance. Compared to state-of-the-art (SOTA) methods, our FTT can reduce the attack success rate by 2 times and improve the accuracy by 5%, shedding light on backdoor cleansing.

Downloads

Published

2024-03-24

How to Cite

Zhang, X., Zheng, B., Hu, J., Li, C., & Bai, X. (2024). From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15), 16873-16880. https://doi.org/10.1609/aaai.v38i15.29629

Issue

Section

AAAI Technical Track on Machine Learning VI