Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Authors

  • Shulei Ji Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University
  • Zihao Wang Zhejiang University Carnegie Mellon University
  • Jiaxing Yu Zhejiang University
  • Xiangyuan Yang Xi'an Jiaotong University
  • Shuyu Li Zhejiang University
  • Songruoyao Wu Zhejiang University
  • Kejun Zhang Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i26.39378

Abstract

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons.

Downloads

Published

2026-03-14

How to Cite

Ji, S., Wang, Z., Yu, J., Yang, X., Li, S., Wu, S., & Zhang, K. (2026). Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 22219–22227. https://doi.org/10.1609/aaai.v40i26.39378

Issue

Section

AAAI Technical Track on Machine Learning III