Compose with Me: Collaborative Music Inpainter for Symbolic Music Infilling

Authors

  • Zhejing Hu The Hong Kong Polytechnic University
  • Yan Liu The Hong Kong Polytechnic University
  • Gong Chen The Hong Kong Polytechnic University
  • Bruce X.B. Yu Zhejiang University-University of Illinois Urbana-Champaign Institute

DOI:

https://doi.org/10.1609/aaai.v39i2.32122

Abstract

The field of music generation has seen a surge of interest from both academia and industry, with innovative platforms such as Suno, Udio, and SkyMusic earning widespread recognition. However, the challenge of music infilling—modifying specific music segments without reconstructing the entire piece—remains a significant hurdle for both audio-based and symbolic-based models, limiting their adaptability and practicality. In this paper, we address symbolic music infilling by introducing the Collaborative Music Inpainter (CMI), an advanced human-in-the-loop (HITL) model for music infilling. The CMI features the Joint Embedding Predictive Autoregressive Generative Architecture (JEP-AGA), which learns the high-level predictive representations of the masked part that needs to be infilled during the autoregressive generative process, akin to how humans perceive and interpret music. The newly developed Dynamic Interaction Learner (DIL) achieves HITL by iteratively refining the infilled output based on user interactions alone, significantly reducing the interaction cost without requiring further input. Experimental results confirm CMI’s superior performance in music infilling, demonstrating its efficiency in producing high-quality music.

Published

2025-04-11

How to Cite

Hu, Z., Liu, Y., Chen, G., & Yu, B. X. (2025). Compose with Me: Collaborative Music Inpainter for Symbolic Music Infilling. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1327–1335. https://doi.org/10.1609/aaai.v39i2.32122

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems