Fine-Grained Position Helps Memorizing More, a Novel Music Compound Transformer Model with Feature Interaction Fusion

Authors

  • Zuchao Li Shool of Computer Science, Wuhan University
  • Ruhan Gong School of Computer Science, Wuhan University
  • Yineng Chen Shool of Computer Science, Wuhan University
  • Kehua Su Shool of Computer Science, Wuhan University

DOI:

https://doi.org/10.1609/aaai.v37i4.25650

Keywords:

APP: Art/Music/Creativity, SNLP: Applications, SNLP: Text Classification

Abstract

Due to the particularity of the simultaneous occurrence of multiple events in music sequences, compound Transformer is proposed to deal with the challenge of long sequences. However, there are two deficiencies in the compound Transformer. First, since the order of events is more important for music than natural language, the information provided by the original absolute position embedding is not precise enough. Second, there is an important correlation between the tokens in the compound word, which is ignored by the current compound Transformer. Therefore, in this work, we propose an improved compound Transformer model for music understanding. Specifically, we propose an attribute embedding fusion module and a novel position encoding scheme with absolute-relative consideration. In the attribute embedding fusion module, different attributes are fused through feature permutation by using a multi-head self-attention mechanism in order to capture rich interactions between attributes. In the novel position encoding scheme, we propose RoAR position encoding, which realizes rotational absolute position encoding, relative position encoding, and absolute-relative position interactive encoding, providing clear and rich orders for musical events. Empirical study on four typical music understanding tasks shows that our attribute fusion approach and RoAR position encoding brings large performance gains. In addition, we further investigate the impact of masked language modeling and casual language modeling pre-training on music understanding.

Downloads

Published

2023-06-26

How to Cite

Li, Z., Gong, R., Chen, Y., & Su, K. (2023). Fine-Grained Position Helps Memorizing More, a Novel Music Compound Transformer Model with Feature Interaction Fusion. Proceedings of the AAAI Conference on Artificial Intelligence, 37(4), 5203-5212. https://doi.org/10.1609/aaai.v37i4.25650

Issue

Section

AAAI Technical Track on Domain(s) of Application