PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Authors

  • Sijie Wang Harbin Institute of Technology,Shenzhen
  • Qiang Wang Harbin Institute of Technology,Shenzhen
  • Shaohuai Shi Harbin Institute of Technology,Shenzhen

DOI:

https://doi.org/10.1609/aaai.v40i12.37976

Abstract

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remarkable capabilities. However, their practical deployment is often hindered by slow inference speeds and high memory consumption. In this paper, we propose a novel pipelining framework named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and communication among multiple GPUs to be pipelined, thus reducing the inference latency. Second, we propose DeDiVAE to decouple the diffusion module and the VAE module into two GPU groups whose executions can also be pipelined to reduce the memory consumption and inference latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and HunyuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8-GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06× to 4.02× speedups over OpenSoraPlan and HunyuanVideo.

Downloads

Published

2026-03-14

How to Cite

Wang, S., Wang, Q., & Shi, S. (2026). PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10092–10100. https://doi.org/10.1609/aaai.v40i12.37976

Issue

Section

AAAI Technical Track on Computer Vision IX