Decoupling What to Count and Where to See for Referring Expression Counting
DOI:
https://doi.org/10.1609/aaai.v40i16.38423Abstract
Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for ''walking''). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into ''what to count'' and ''where to see'' via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively.Published
2026-03-14
How to Cite
Zou, Y., Zhang, Z., & Xu, Y. (2026). Decoupling What to Count and Where to See for Referring Expression Counting. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 14113–14121. https://doi.org/10.1609/aaai.v40i16.38423
Issue
Section
AAAI Technical Track on Computer Vision XIII