Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Rongyu Zhang; Aosong Cheng; Yulin Luo; Gaole Dai; Huanrui Yang; Jiaming Liu; Ran Xu; Li Du; Dan Wang; Yuan Du

doi:10.1609/aaai.v40i42.40922

Authors

Rongyu Zhang Nanjing University The Hong Kong Polytechnic University State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Aosong Cheng State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yulin Luo State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Gaole Dai State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Huanrui Yang University of Arizona
Jiaming Liu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ran Xu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Li Du Nanjing University
Dan Wang Hong Kong University of Science and Technology
Yuan Du Nanjing University

DOI:

https://doi.org/10.1609/aaai.v40i42.40922

Abstract

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the the encoding characteristics of neuron activation in neural networks, we propose the Mixture-of-Activation-Sparsity-Experts (MoASE) for the CTTA task. Given the distinct reaction of neurons with low and high activation to domain-specific and agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components in each expert with a Spatial Differentiable Dropout (SDD). Based on the decomposition, we devise a Domain-Aware Router (DAR) that utilizes domain information to adaptively weight experts that process the post-SDD sparse activations, and the Activation Sparsity Gate (ASG) that adaptively assigns feature selection thresholds of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to maintain update consistency between the teacher and student experts to prevent error accumulation. Extensive experiments substantiate that MoASE achieves state-of-the-art performance in both classification and segmentation tasks.

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information