ShareBERT: Embeddings Are Capable of Learning Hidden Layers
DOI:
https://doi.org/10.1609/aaai.v38i16.29781Keywords:
NLP: (Large) Language Models, CSO: Constraint OptimizationAbstract
The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert.Downloads
Published
2024-03-24
How to Cite
Hu, J. C., Cavicchioli, R., Berardinelli, G., & Capotondi, A. (2024). ShareBERT: Embeddings Are Capable of Learning Hidden Layers. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18225-18233. https://doi.org/10.1609/aaai.v38i16.29781
Issue
Section
AAAI Technical Track on Natural Language Processing I