WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation

Authors

  • Zishan Shu School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
  • Juntong Wu School of Electronic and Computer Engineering, Peking University, Shenzhen, China
  • Wei Yan School of Physics, Peking University, Beijing, China
  • Xudong Liu School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
  • Hongyu Zhang School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
  • Chang Liu Department of Automation and BNRist, Tsinghua University, Beijing, China
  • Youdong Mao School of Physics, Peking University, Beijing, China Center for Quantitative Biology, Peking University, Beijing, China National Biomedical Imaging Center, Peking University, Beijing, China Peking-Tsinghua Joint Center for Life Sciences, Peking University, Beijing, China
  • Jie Chen School of Electronic and Computer Engineering, Peking University, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China

DOI:

https://doi.org/10.1609/aaai.v40i30.39737

Abstract

Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency—from low-frequency global layout to high-frequency edges and textures—is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency–time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(NlogN) time—far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6× higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics.

Published

2026-03-14

How to Cite

Shu, Z., Wu, J., Yan, W., Liu, X., Zhang, H., Liu, C., … Chen, J. (2026). WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25428–25436. https://doi.org/10.1609/aaai.v40i30.39737

Issue

Section

AAAI Technical Track on Machine Learning VII