LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Hao Fu; Shaojun Zhou; Qihong Yang; Junjie Tang; Guiquan Liu; Kaikui Liu; Xiaolong Li

doi:10.1609/aaai.v35i14.17518

Authors

Hao Fu University of Science and Technology of China
Shaojun Zhou Alibaba Group
Qihong Yang Alibaba Group
Junjie Tang Alibaba Group
Guiquan Liu University of Science and Technology of China
Kaikui Liu Alibaba Group
Xiaolong Li Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v35i14.17518

Keywords:

Language Models

Abstract

The pre-training models such as BERT have achieved great results in various natural language processing problems. However, a large number of parameters need significant amounts of memory and the consumption of inference time, which makes it difficult to deploy them on edge devices. In this work, we propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods. Furthermore, we introduce a gradient perturbation-based training architecture in the training phase to increase the robustness of LRC-BERT, which is the first attempt in knowledge distillation. Additionally, in order to better capture the distribution characteristics of the intermediate layer, we design a two-stage training method for the total distillation loss. Finally, by verifying 8 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods, which proves the effectiveness of our method.

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information