Vision-MoR: Scaling Vision Transformer via Patch-Level Mixture-of-Recursions

Authors

  • Yunhong He Independent Researcher
  • Zhengqing Yuan University of Notre Dame
  • Weixiang Sun University of Notre Dame
  • Yiyang Li University of Notre Dame
  • Yixin Liu Lehigh University
  • Yanfang Ye University of Notre Dame
  • Lichao Sun Lehigh University

DOI:

https://doi.org/10.1609/aaai.v40i6.42471

Abstract

Scaling Vision Transformers (ViTs) has yielded remarkable advancements in diverse vision tasks, albeit at the cost of escalating computational, memory, and parameter demands. Existing efficiency techniques typically address only one dimension, computation, memory, or parameters, lacking a cohesive approach. In this paper, we introduce Vision-MoR, a novel ViT architecture that unifies parameter sharing, spatially adaptive computation, and memory-efficient design into a single framework. Vision-MoR employs a spatial-aware router with shifted-window attention to dynamically assign per-patch recursion depths, coupled with a recursive Transformer loop enabling token-wise early exiting. This facilitates content-adaptive processing and recursive parameter reuse while preserving spatial locality. On ImageNet-1K, Vision-MoR Small attains 74.6% Top-1 accuracy with 140M FLOPs and 5.7M parameters, outperforming EfficientViT-M2 (70.8%) and SHViT-S1 (72.8%) at superior throughput. The Vision-MoR X-Large variant achieves 80.4% Top-1 and 95.2% Top-5 accuracy using 14.3M parameters and 2044M FLOPs, surpassing ResNet-50 and EfficientNet-B1. On COCO object detection, Vision-MoR X-Large yields 39.1 AP with the lowest latency among comparable models. These results underscore Vision-MoR's state-of-the-art accuracy-efficiency trade-offs, positioning it as a scalable, deployment-friendly backbone for real-time vision applications.

Downloads

Published

2026-03-14

How to Cite

He, Y., Yuan, Z., Sun, W., Li, Y., Liu, Y., Ye, Y., & Sun, L. (2026). Vision-MoR: Scaling Vision Transformer via Patch-Level Mixture-of-Recursions. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4699-4707. https://doi.org/10.1609/aaai.v40i6.42471

Issue

Section

AAAI Technical Track on Computer Vision III