DFDNet: Disentangling and Filtering Dynamics for Enhanced Video Prediction

Lianqiang Gan; Junyu Lai; Jingze Ju; Lianli Gao; Yi Bin

doi:10.1609/aaai.v39i3.32314

Authors

Lianqiang Gan School of Aeronautics and Astronautics, University of Electronic Science and Technology of China
Junyu Lai School of Aeronautics and Astronautics, University of Electronic Science and Technology of China Aircraft Swarm Intelligent Sensing and Cooperative Control Key Laboratory of Sichuan Province
Jingze Ju School of Aeronautics and Astronautics, University of Electronic Science and Technology of China
Lianli Gao School of Computer Science and Engineering, University of Electronic Science and Technology of China
Yi Bin School of Computer Science and Technology, Tongji University

DOI:

https://doi.org/10.1609/aaai.v39i3.32314

Abstract

Videos inherently contain complex temporal dynamics across various spatial directions, often entangled in ways that obscure effective dynamic extraction. Previous studies typically process video spatiotemporal features without disentangling, which hampers their ability to extract dynamic information. Additionally, the extraction of dynamics is disrupted by transient high-dynamic information in video sequences, e.g., noise or flicker, which has received limited attention in the literature. To tackle those problems, this paper proposes the Disentangling and Filtering Dynamics Network (DFDNet). Firstly, to disentangle the interwoven dynamics, DFDNet decomposes the spatially encoded video sequences into lower dimensional sequences. Secondly, a learnable threshold filter is proposed to eliminate the transient high-dynamic information. Thirdly, the model incorporates an MLP to extract the temporal dependencies from the disentangled and filtered sequences. DFDNet demonstrates competitive performance across four chosen datasets, including both low and high-resolution videos. Specifically, on the low-resolution Moving MNIST dataset, DFDNet achieves a 19% improvement on MSE over the previous state-of-the-art model. On the high-resolution SJTU4K dataset, it outperforms the previous state-of-the-art model by 10% on the LPIPS metric under similar inference time.

DFDNet: Disentangling and Filtering Dynamics for Enhanced Video Prediction

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information