ShareBERT: Embeddings Are Capable of Learning Hidden Layers

Jia Cheng Hu; Roberto Cavicchioli; Giulia Berardinelli; Alessandro Capotondi

doi:10.1609/aaai.v38i16.29781

Authors

Jia Cheng Hu University of Modena and Reggio Emilia
Roberto Cavicchioli University of Modena and Reggio Emilia
Giulia Berardinelli University of Modena and Reggio Emilia
Alessandro Capotondi University of Modena and Reggio Emilia

DOI:

https://doi.org/10.1609/aaai.v38i16.29781

Keywords:

NLP: (Large) Language Models, CSO: Constraint Optimization

Abstract

The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert.

ShareBERT: Embeddings Are Capable of Learning Hidden Layers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription