SwiftVideo: A Unified Framework for Few-Step Video Generation Through Trajectory-Distribution Alignment

Authors

  • Yanxiao Sun Fudan University
  • Jiafu Wu Tencent Youtu Lab
  • Yun Cao Tencent Youtu Lab
  • Chengming Xu Tencent Youtu Lab
  • Yabiao Wang Tencent Youtu Lab
  • Weijian Cao Tencent Youtu Lab
  • Donghao Luo Tencent Youtu Lab
  • Chengjie Wang Tencent Youtu Lab
  • Yanwei Fu Fudan University Shanghai Innovation Institute

DOI:

https://doi.org/10.1609/aaai.v40i11.37881

Abstract

Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts in few-step settings. To address these limitations, we propose SwiftVideo, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, We propose a dual-perspective alignment encompassing distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.

Downloads

Published

2026-03-14

How to Cite

Sun, Y., Wu, J., Cao, Y., Xu, C., Wang, Y., Cao, W., Luo, D., Wang, C., & Fu, Y. (2026). SwiftVideo: A Unified Framework for Few-Step Video Generation Through Trajectory-Distribution Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9233-9241. https://doi.org/10.1609/aaai.v40i11.37881

Issue

Section

AAAI Technical Track on Computer Vision VIII