Transformer as Linear Expansion of Learngene

Authors

  • Shiyu Xia School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Miaosen Zhang School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Xu Yang School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Ruiming Chen School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Haokun Chen School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
  • Xin Geng School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China

DOI:

https://doi.org/10.1609/aaai.v38i14.29533

Keywords:

ML: Deep Neural Architectures and Foundation Models, ML: Classification and Regression, ML: Deep Learning Algorithms

Abstract

We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2× training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19× parameters stored to initialize these models and around 5× pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9× parameters stored to initialize, compared to the pre-training approach.

Published

2024-03-24

How to Cite

Xia, S., Zhang, M., Yang, X., Chen, R., Chen, H., & Geng, X. (2024). Transformer as Linear Expansion of Learngene. Proceedings of the AAAI Conference on Artificial Intelligence, 38(14), 16014-16022. https://doi.org/10.1609/aaai.v38i14.29533

Issue

Section

AAAI Technical Track on Machine Learning V