DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction
DOI:
https://doi.org/10.1609/aaai.v40i6.42480Abstract
Dense visual prediction tasks, including object detection and segmentation, inherently require precise and discriminative positional information to delineate object boundaries and pixel regions. Recent DETR-based frameworks advance dense prediction tasks through iterative attention applied to content queries, with sampled proposals as position references. However, this paradigm suffers from the misaligned sampling distribution and insufficient interaction between the content and position features, thereby limiting the encoding effectiveness. To overcome these limitations, we investigate the encoding paradigm for content-position harmonization and propose an effective predictor for dense visual tasks, termed DAPE (DETR with hArmonized content-Position Encoding). DAPE introduces explicit position encoding to facilitate content enhancement while maintaining low memory overhead. To achieves this process, DAPE comprises a Shifted Query Sampler (SQS) that enforces strict alignment between the distributions of content and position queries, and a 2D Low-Rank Position Encoder (LRPE) that progressively modulates attention maps based on the aligned representations. DAPE provides a unified solution for various dense prediction tasks. Extensive experiments on object detection, instance segmentation, and few-shot detection benchmarks demonstrate that DAPE achieves state-of-the-art performance while reducing memory consumption.Downloads
Published
2026-03-14
How to Cite
Hou, X., Liu, M., Zhang, S., & Du, S. (2026). DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4780–4788. https://doi.org/10.1609/aaai.v40i6.42480
Issue
Section
AAAI Technical Track on Computer Vision III