MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings
DOI:
https://doi.org/10.1609/aaai.v40i37.40348Abstract
Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training, thereby improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.Published
2026-03-14
How to Cite
Gurioli, A., Pennino, F., Monteiro, J., & Gabbrielli, M. (2026). MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 30897-30906. https://doi.org/10.1609/aaai.v40i37.40348
Issue
Section
AAAI Technical Track on Natural Language Processing II