ShareBERT: Embeddings Are Capable of Learning Hidden Layers

Authors

  • Jia Cheng Hu University of Modena and Reggio Emilia
  • Roberto Cavicchioli University of Modena and Reggio Emilia
  • Giulia Berardinelli University of Modena and Reggio Emilia
  • Alessandro Capotondi University of Modena and Reggio Emilia

DOI:

https://doi.org/10.1609/aaai.v38i16.29781

Keywords:

NLP: (Large) Language Models, CSO: Constraint Optimization

Abstract

The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert.

Published

2024-03-24

How to Cite

Hu, J. C., Cavicchioli, R., Berardinelli, G., & Capotondi, A. (2024). ShareBERT: Embeddings Are Capable of Learning Hidden Layers. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18225-18233. https://doi.org/10.1609/aaai.v38i16.29781

Issue

Section

AAAI Technical Track on Natural Language Processing I