Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

Bo Wang; Junzhuo Li; Hong Chen; Yuanlin Chu; Yuxuan Fan; Xuming Hu

doi:10.1609/aaai.v40i39.40622

Authors

Bo Wang The Hong Kong University of Science and Technology (Guangzhou)
Junzhuo Li The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology
Hong Chen The Hong Kong University of Science and Technology (Guangzhou)
Yuanlin Chu The Hong Kong University of Science and Technology (Guangzhou)
Yuxuan Fan The Hong Kong University of Science and Technology (Guangzhou)
Xuming Hu The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i39.40622

Abstract

Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training—and how this process differs from dense architectures—remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M (~ 5.0T tokens) and 600K (~ 2.5T tokens) training steps, respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within < 100K steps, whereas the dense model remains volatile throughout training. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by < 10%, compared with > 50% for the dense model, showing that sparsity fosters distributed—rather than brittle—knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.

Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information