DIFT: Protecting Contrastive Learning Against Data Poisoning Backdoor Attacks

Jiang Zhu; Yulin Jin; Qingqing Ye; Zhibiao Guo; Kun Fang; Ruochen Du; Yingnan Zhao; Haibo Hu

doi:10.1609/aaai.v40i2.37141

Authors

Jiang Zhu Hong Kong Polytechnic University
Yulin Jin Hong Kong Polytechnic University
Qingqing Ye The Hong Kong Polytechnic University
Zhibiao Guo Hong Kong Polytechnic University
Kun Fang The Hong Kong Polytechnic University
Ruochen Du Harbin Engineering University
Yingnan Zhao Harbin Institute of Technology
Haibo Hu Hong Kong Polytechnic University

DOI:

https://doi.org/10.1609/aaai.v40i2.37141

Abstract

Contrastive learning (CL) is a popular learning paradigm that excels in extracting meaningful representations from unlabeled data. Recent studies have shown that CL is highly vulnerable to backdoor attacks. Current defenses against backdoor attacks in CL are primarily reactive and post-training. That is, the detection and elimination of backdoors are executed in the deployment phase of a given well-trained model. However, these post-training defenses are usually prone to degrading model utility and resource-intensive, causing that the backdoor detection and elimination from a fully-trained model is quite challenging. To address this issue, we argue for a fundamental perspective, i.e., integrating the defense into the model's training phase, and propose a novel framework to mitigate the backdoor in CL, namely Density-Based Identification and Fine-Tuning (DIFT). Specifically, DIFT identifies potential poisoned samples during the early training phase via detecting embeddings with abnormal poisoning characteristic in the feature space. Then, to remove backdoors and preserve model utility, the detected poisoned samples are leveraged to fine-tune the model, and the remaining clean samples are further involved into training the model after the fine-tuning. DIFT, as a proactive training-time defense, avoids the problematic backdoor removal and the high computational cost associated with those reactive post-training methods. We empirically evaluate DIFT on various CL algorithms against backdoor attack. Experimental results demonstrate that our method exhibits promising defense effectiveness while maintaining model's clean data accuracy.

DIFT: Protecting Contrastive Learning Against Data Poisoning Backdoor Attacks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information