Maximizing the Position Embedding for Vision Transformers with Global Average Pooling

Authors

  • Wonjun Lee Yonsei University, Republic of Korea Korea Institute of Science and Technology, Republic of Korea
  • Bumsub Ham Yonsei University, Republic of Korea
  • Suhyun Kim Korea Institute of Science and Technology, Republic of Korea

DOI:

https://doi.org/10.1609/aaai.v39i17.33997

Abstract

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent LNs for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the counterbalancing role of PE is insufficient in the layer-wise structure, and we address this by maximizing the effectiveness of PE through MPVG. Through experiments, we demonstrate that PE performs a counterbalancing role and that maintaining this counterbalancing directionality significantly impacts vision transformers. As a result, the experimental results show that MPVG outperforms existing methods across vision transformers on various tasks.

Downloads

Published

2025-04-11

How to Cite

Lee, W., Ham, B., & Kim, S. (2025). Maximizing the Position Embedding for Vision Transformers with Global Average Pooling. Proceedings of the AAAI Conference on Artificial Intelligence, 39(17), 18154–18162. https://doi.org/10.1609/aaai.v39i17.33997

Issue

Section

AAAI Technical Track on Machine Learning III