Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

Authors

  • Longhua Li Southeast University
  • Lei Qi Southeast University
  • Xin Geng Southeast University

DOI:

https://doi.org/10.1609/aaai.v40i27.39463

Abstract

Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce Weighted PCA for Attention Contraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose Progressive Importance-Aware Dropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.

Downloads

Published

2026-03-14

How to Cite

Li, L., Qi, L., & Geng, X. (2026). Stratified Knowledge-Density Super-Network for Scalable Vision Transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 40(27), 22985–22993. https://doi.org/10.1609/aaai.v40i27.39463

Issue

Section

AAAI Technical Track on Machine Learning IV