MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Andrea Gurioli; Federico Pennino; Joao Monteiro; Maurizio Gabbrielli

doi:10.1609/aaai.v40i37.40348

Authors

Andrea Gurioli University of Bologna
Federico Pennino University of Bologna
Joao Monteiro Apple MLR
Maurizio Gabbrielli University of Bologna

DOI:

https://doi.org/10.1609/aaai.v40i37.40348

Abstract

Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training, thereby improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.

MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information