Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations

Authors

  • Xuesong Nie Zhejiang University
  • Yunfeng Yan Zhejiang University
  • Siyuan Li Zhejiang University Westlake University
  • Cheng Tan Zhejiang University Westlake University
  • Xi Chen The University of Hong Kong
  • Haoyuan Jin Zhejiang University
  • Zhihang Zhu Zhejiang University
  • Stan Z. Li Westlake University
  • Donglian Qi Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v38i5.28230

Keywords:

CV: Video Understanding & Activity Analysis, ML: Unsupervised & Self-Supervised Learning

Abstract

Spatiotemporal predictive learning is a paradigm that empowers models to learn spatial and temporal patterns by predicting future frames from past frames in an unsupervised manner. This method typically uses recurrent units to capture long-term dependencies, but these units often come with high computational costs and limited performance in real-world scenes. This paper presents an innovative Wavelet-based SpatioTemporal (WaST) framework, which extracts and adaptively controls both low and high-frequency components at image and feature levels via 3D discrete wavelet transform for faster processing while maintaining high-quality predictions. We propose a Time-Frequency Aware Translator uniquely crafted to efficiently learn short- and long-range spatiotemporal information by individually modeling spatial frequency and temporal variations. Meanwhile, we design a wavelet-domain High-Frequency Focal Loss that effectively supervises high-frequency variations. Extensive experiments across various real-world scenarios, such as driving scene prediction, traffic flow prediction, human motion capture, and weather forecasting, demonstrate that our proposed WaST achieves state-of-the-art performance over various spatiotemporal prediction methods.

Published

2024-03-24

How to Cite

Nie, X., Yan, Y., Li, S., Tan, C., Chen, X., Jin, H., Zhu, Z., Li, S. Z., & Qi, D. (2024). Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4334-4342. https://doi.org/10.1609/aaai.v38i5.28230

Issue

Section

AAAI Technical Track on Computer Vision IV