WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation

Zishan Shu; Juntong Wu; Wei Yan; Xudong Liu; Hongyu Zhang; Chang Liu; Youdong Mao; Jie Chen

doi:10.1609/aaai.v40i30.39737

Authors

Zishan Shu School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
Juntong Wu School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Wei Yan School of Physics, Peking University, Beijing, China
Xudong Liu School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
Hongyu Zhang School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
Chang Liu Department of Automation and BNRist, Tsinghua University, Beijing, China
Youdong Mao School of Physics, Peking University, Beijing, China Center for Quantitative Biology, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Peking-Tsinghua Joint Center for Life Sciences, Peking University, Beijing, China
Jie Chen School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China

DOI:

https://doi.org/10.1609/aaai.v40i30.39737

Abstract

Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency—from low-frequency global layout to high-frequency edges and textures—is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency–time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(NlogN) time—far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6× higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics.

WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information