Fine-flow Distilling Coarse-flow Video Generation for Long-Term Driving World Model

Xiaodong Wang; Zhirong Wu; Peixi Peng

doi:10.1609/aaai.v40i31.39860

Authors

Xiaodong Wang School of Electronic and Computer Engineering, Peking University Pengcheng Laboratory
Zhirong Wu School of Electronic and Computer Engineering, Peking University
Peixi Peng School of Electronic and Computer Engineering, Peking University Pengcheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i31.39860

Abstract

Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits practical applications. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips, and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. On NuScenes, compared with the state-of-the-art front-view models, our model improves FVD by 27% and reduces inference time by 85% for the video task of generating 110+ frames.

Fine-flow Distilling Coarse-flow Video Generation for Long-Term Driving World Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information