MSPCaps: A Multi-Scale Patchify Capsule Network with Cross-Agreement Routing for Visual Recognition
DOI:
https://doi.org/10.1609/aaai.v40i6.42501Abstract
Capsule Network (CapsNet) has demonstrated significant potential in visual recognition by capturing spatial relationships and part-whole hierarchies for learning equivariant feature representations. However, existing CapsNet and variants often rely on a single high-level feature map, overlooking the rich complementary information provided by multi-scale features. Furthermore, conventional feature fusion strategies, such as addition and concatenation, struggle to reconcile multi-scale feature discrepancies, leading to suboptimal classification performance. To address these limitations, we propose the Multi-Scale Patchify Capsule Network (MSPCaps), a novel architecture that integrates multi-scale feature learning and efficient capsule routing. Specifically, MSPCaps consists of three key components: a Multi-Scale ResNet Backbone (MSRB), a Patchify Capsule Layer (PatchifyCaps), and a Cross-Agreement Routing (CAR) block. First, the MSRB extracts diverse multi-scale feature representations from input images, preserving both fine-grained details and global contextual information. Second, the PatchifyCaps partitions these multi-scale features into primary capsules using a uniform patch size, equipping the model with the ability to learn from diverse receptive fields. Finally, the CAR block adaptively routes the multi-scale capsules by identifying cross-scale prediction pairs with maximum agreement. Unlike the simple concatenation of multiple self-routing blocks, CAR ensures that only the most coherent capsules (best part-to-whole pairs) contribute to the final voting. Our proposed MSPCaps achieves remarkable scalability and superior robustness, consistently surpassing multiple baseline methods in terms of classification accuracy, with configurations ranging from a highly efficient Tiny model (344.3K parameters) to a powerful Large model (10.9M parameters), highlighting its potential in advancing feature representation learning.Published
2026-03-14
How to Cite
Hu, Y., Han, Y., Sun, R., & Ren, J. (2026). MSPCaps: A Multi-Scale Patchify Capsule Network with Cross-Agreement Routing for Visual Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4968–4975. https://doi.org/10.1609/aaai.v40i6.42501
Issue
Section
AAAI Technical Track on Computer Vision III