Vision-MoR: Scaling Vision Transformer via Patch-Level Mixture-of-Recursions
DOI:
https://doi.org/10.1609/aaai.v40i6.42471Abstract
Scaling Vision Transformers (ViTs) has yielded remarkable advancements in diverse vision tasks, albeit at the cost of escalating computational, memory, and parameter demands. Existing efficiency techniques typically address only one dimension, computation, memory, or parameters, lacking a cohesive approach. In this paper, we introduce Vision-MoR, a novel ViT architecture that unifies parameter sharing, spatially adaptive computation, and memory-efficient design into a single framework. Vision-MoR employs a spatial-aware router with shifted-window attention to dynamically assign per-patch recursion depths, coupled with a recursive Transformer loop enabling token-wise early exiting. This facilitates content-adaptive processing and recursive parameter reuse while preserving spatial locality. On ImageNet-1K, Vision-MoR Small attains 74.6% Top-1 accuracy with 140M FLOPs and 5.7M parameters, outperforming EfficientViT-M2 (70.8%) and SHViT-S1 (72.8%) at superior throughput. The Vision-MoR X-Large variant achieves 80.4% Top-1 and 95.2% Top-5 accuracy using 14.3M parameters and 2044M FLOPs, surpassing ResNet-50 and EfficientNet-B1. On COCO object detection, Vision-MoR X-Large yields 39.1 AP with the lowest latency among comparable models. These results underscore Vision-MoR's state-of-the-art accuracy-efficiency trade-offs, positioning it as a scalable, deployment-friendly backbone for real-time vision applications.Downloads
Published
2026-03-14
How to Cite
He, Y., Yuan, Z., Sun, W., Li, Y., Liu, Y., Ye, Y., & Sun, L. (2026). Vision-MoR: Scaling Vision Transformer via Patch-Level Mixture-of-Recursions. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4699-4707. https://doi.org/10.1609/aaai.v40i6.42471
Issue
Section
AAAI Technical Track on Computer Vision III