TY - JOUR AU - Guo, Qipeng AU - Qiu, Xipeng AU - Liu, Pengfei AU - Xue, Xiangyang AU - Zhang, Zheng PY - 2020/04/03 Y2 - 2024/03/29 TI - Multi-Scale Self-Attention for Text Classification JF - Proceedings of the AAAI Conference on Artificial Intelligence JA - AAAI VL - 34 IS - 05 SE - AAAI Technical Track: Natural Language Processing DO - 10.1609/aaai.v34i05.6290 UR - https://ojs.aaai.org/index.php/AAAI/article/view/6290 SP - 7847-7854 AB - <p>In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.</p> ER -