Active Token Mixer

Guoqiang Wei; Zhizheng Zhang; Cuiling Lan; Yan Lu; Zhibo Chen

doi:10.1609/aaai.v37i3.25376

Authors

Guoqiang Wei University of Science and Technology of China
Zhizheng Zhang Microsoft Research Asia
Cuiling Lan Microsoft Research Asia
Yan Lu Microsoft Research Asia
Zhibo Chen University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v37i3.25376

Keywords:

CV: Other Foundations of Computer Vision, CV: Learning & Optimization for CV, CV: Object Detection & Categorization, CV: Segmentation

Abstract

The three existing dominant network families, i.e., CNNs, Transformers and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATMs as the primary operators and assemble them into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.

Active Token Mixer

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription