Oscillation Inversion: Training-Free Image and Video Enhancement Through Oscillated Latents in Large Flow Models

Authors

  • Yan Zheng University of Texas at Austin
  • Zhenxiao Liang University of Texas at Austin
  • Xiaoyan Cong Brown University
  • Yi Yang The University of Edinburgh
  • Lanqing Guo University of Texas at Austin
  • Yuehao Wang University of Texas at Austin
  • Peihao Wang University of Texas at Austin
  • Zhangyang Wang University of Texas at Austin

DOI:

https://doi.org/10.1609/aaai.v40i16.38352

Abstract

We explore the oscillatory behavior observed in inversion methods applied to large-scale flow models, including text-to-image and text-to-video. By employing an augmented fixed-point-inspired iterative approach to invert real-world images, we observe that the solution does not achieve convergence, instead oscillating between distinct clusters. Through both experiments on synthetic data, text-to-image and text-to-video, we demonstrate that these oscillating clusters exhibit notable semantic coherence. We offer theoretical insights, showing that this behavior arises from oscillatory dynamics in flow models. Building on this understanding, we introduce a simple and fast distribution transfer technique that facilitates training-free image and video editing/enhancement. Furthermore, we provide quantitative results demonstrating the effectiveness of our method on tasks such as image enhancement, editing, and reconstruction. Notably, our approach enables the transformation of image-only enhancers and editors into lightweight, video-capable tools—without additional training—highlighting its practical versatility and impact.

Published

2026-03-14

How to Cite

Zheng, Y., Liang, Z., Cong, X., Yang, Y., Guo, L., Wang, Y., Wang, P., & Wang, Z. (2026). Oscillation Inversion: Training-Free Image and Video Enhancement Through Oscillated Latents in Large Flow Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13476-13484. https://doi.org/10.1609/aaai.v40i16.38352

Issue

Section

AAAI Technical Track on Computer Vision XIII