MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation

Authors

  • Zhiwei Yang Academy for Engineering and Technology, Fudan University, Shanghai 200433, China Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China
  • Yucong Meng Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China
  • Kexue Fu Shandong Computer Science Center (National Supercomputer Center in Jinan)
  • Shuo Wang Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China
  • Zhijian Song Academy for Engineering and Technology, Fudan University, Shanghai 200433, China Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China

DOI:

https://doi.org/10.1609/aaai.v39i9.33018

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments on PASCAL VOC and MS COCO validate that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods.

Downloads

Published

2025-04-11

How to Cite

Yang, Z., Meng, Y., Fu, K., Wang, S., & Song, Z. (2025). MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9400-9408. https://doi.org/10.1609/aaai.v39i9.33018

Issue

Section

AAAI Technical Track on Computer Vision VIII