Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame

Authors

  • Qinglong Cao Shanghai Jiao Tong University, Shanghai Eastern Institute of Technology, Ningbo
  • Xirui Li Shanghai Jiao Tong University, Shanghai
  • Ding Wang Shanghai Jiao Tong University, Shanghai Eastern Institute of Technology, Ningbo
  • Chao Ma Shanghai Jiao Tong University, Shanghai
  • Yuntian Chen Eastern Institute of Technology, Ningbo
  • Xiaokang Yang Shanghai Jiao Tong University, Shanghai

DOI:

https://doi.org/10.1609/aaai.v40i4.37250

Abstract

Video diffusion models have achieved impressive results in natural scene generation, yet they struggle to generalize to scientific phenomena such as fluid simulations and meteorological processes, where underlying dynamics are governed by scientific laws. These tasks pose unique challenges, including severe domain gaps, limited training data, and the lack of descriptive language annotations. To handle this dilemma, we extracted the latent scientific phenomena knowledge and further proposed a fresh framework that teaches video diffusion models to generate scientific phenomena from a single initial frame. Particularly, static knowledge is extracted via pre-trained masked autoencoders, while dynamic knowledge is derived from pre-trained optical flow prediction. Subsequently, based on the aligned spatial relations between the CLIP vision and language encoders, the visual embeddings of scientific phenomena, guided by latent scientific phenomena knowledge, are projected to generate the pseudo-language prompt embeddings in both spatial and frequency domains. By incorporating these prompts and fine-tuning the video diffusion model, we enable the generation of videos that better adhere to scientific laws. Extensive experiments on both computational fluid dynamics simulations and real-world typhoon observations demonstrate the effectiveness of our approach, achieving superior fidelity and consistency across diverse scientific scenarios.

Published

2026-03-14

How to Cite

Cao, Q., Li, X., Wang, D., Ma, C., Chen, Y., & Yang, X. (2026). Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2625–2633. https://doi.org/10.1609/aaai.v40i4.37250

Issue

Section

AAAI Technical Track on Computer Vision I