Decoupling What to Count and Where to See for Referring Expression Counting

Yuda Zou; Zijian Zhang; Yongchao Xu

doi:10.1609/aaai.v40i16.38423

Authors

Yuda Zou National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Hubei Luojia Laboratory, Wuhan University, Wuhan 430072, China
Zijian Zhang National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Hubei Luojia Laboratory, Wuhan University, Wuhan 430072, China
Yongchao Xu National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Hubei Luojia Laboratory, Wuhan University, Wuhan 430072, China

DOI:

https://doi.org/10.1609/aaai.v40i16.38423

Abstract

Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for ''walking''). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into ''what to count'' and ''where to see'' via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively.

Decoupling What to Count and Where to See for Referring Expression Counting

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information