3D-DRES: Detailed 3D Referring Expression Segmentation

Authors

  • Qi Chen Xiamen University
  • Changli Wu Xiamen University Shanghai Innovation Institute
  • Jiayi Ji Xiamen University National University of Singapore
  • Yiwei Ma Xiamen University
  • Liujuan Cao Xiamen University

DOI:

https://doi.org/10.1609/aaai.v40i4.37288

Abstract

Current 3D visual grounding tasks only process sentence-level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision-language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 55,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

Downloads

Published

2026-03-14

How to Cite

Chen, Q., Wu, C., Ji, J., Ma, Y., & Cao, L. (2026). 3D-DRES: Detailed 3D Referring Expression Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2966-2974. https://doi.org/10.1609/aaai.v40i4.37288

Issue

Section

AAAI Technical Track on Computer Vision I