Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Shulei Ji; Zihao Wang; Jiaxing Yu; Xiangyuan Yang; Shuyu Li; Songruoyao Wu; Kejun Zhang

doi:10.1609/aaai.v40i26.39378

Authors

Shulei Ji Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University
Zihao Wang Zhejiang University Carnegie Mellon University
Jiaxing Yu Zhejiang University
Xiangyuan Yang Xi'an Jiaotong University
Shuyu Li Zhejiang University
Songruoyao Wu Zhejiang University
Kejun Zhang Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i26.39378

Abstract

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons.

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information