DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction

Xiuquan Hou; Meiqin Liu; Senlin Zhang; Shaoyi Du

doi:10.1609/aaai.v40i6.42480

Authors

Xiuquan Hou Xi'an Jiaotong University
Meiqin Liu Zhejiang University Xi'an Jiaotong University
Senlin Zhang Zhejiang University
Shaoyi Du Xi'an Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i6.42480

Abstract

Dense visual prediction tasks, including object detection and segmentation, inherently require precise and discriminative positional information to delineate object boundaries and pixel regions. Recent DETR-based frameworks advance dense prediction tasks through iterative attention applied to content queries, with sampled proposals as position references. However, this paradigm suffers from the misaligned sampling distribution and insufficient interaction between the content and position features, thereby limiting the encoding effectiveness. To overcome these limitations, we investigate the encoding paradigm for content-position harmonization and propose an effective predictor for dense visual tasks, termed DAPE (DETR with hArmonized content-Position Encoding). DAPE introduces explicit position encoding to facilitate content enhancement while maintaining low memory overhead. To achieves this process, DAPE comprises a Shifted Query Sampler (SQS) that enforces strict alignment between the distributions of content and position queries, and a 2D Low-Rank Position Encoder (LRPE) that progressively modulates attention maps based on the aligned representations. DAPE provides a unified solution for various dense prediction tasks. Extensive experiments on object detection, instance segmentation, and few-shot detection benchmarks demonstrate that DAPE achieves state-of-the-art performance while reducing memory consumption.

DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information