MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Authors

  • Run Ling JD.com, Inc. Northeastern University
  • Ke Cao University of Science and Technology of China
  • Jian Lu Chongqing University of Post and Telecommunications
  • Ao Ma JD.com, Inc.
  • Haowei Liu Chongqing University of Post and Telecommunications
  • Runze He University of Chinese Academy of Sciences
  • Changwei Wang University of Chinese Academy of Sciences
  • Rongtao Xu University of Chinese Academy of Sciences
  • Yihua Shao University of Chinese Academy of Sciences
  • Zhanjie Zhang JD.com, Inc.
  • Peng Wu Northwestern Polytechnical University
  • Guibing Guo Northeastern University
  • Wei Feng JD.com, Inc.
  • Zheng Zhang JD.com, Inc.
  • Jingjing Lv JD.com, Inc.
  • Junjie Shen JD.com, Inc.
  • Ching Law JD.com, Inc.
  • Xingwei Wang Northeastern University

DOI:

https://doi.org/10.1609/aaai.v40i9.37638

Abstract

Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

Published

2026-03-14

How to Cite

Ling, R., Cao, K., Lu, J., Ma, A., Liu, H., He, R., … Wang, X. (2026). MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7033–7041. https://doi.org/10.1609/aaai.v40i9.37638

Issue

Section

AAAI Technical Track on Computer Vision VI