MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Authors

  • Andrea Gurioli University of Bologna
  • Federico Pennino University of Bologna
  • Joao Monteiro Apple MLR
  • Maurizio Gabbrielli University of Bologna

DOI:

https://doi.org/10.1609/aaai.v40i37.40348

Abstract

Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training, thereby improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.

Downloads

Published

2026-03-14

How to Cite

Gurioli, A., Pennino, F., Monteiro, J., & Gabbrielli, M. (2026). MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 30897-30906. https://doi.org/10.1609/aaai.v40i37.40348

Issue

Section

AAAI Technical Track on Natural Language Processing II