DiffusionREC: Diffusion Model with Adaptive Condition for Referring Expression Comprehension

Authors

  • Jingcheng Ke Guangdong University of Technology, Guangzhou, China
  • Waikeung Wong School of Fashion and Textiles, Hong Kong, The Hong Kong Polytechnic University, Hong Kong
  • Jia Wang Guangdong Pharmaceutical University, Guangzhou, China
  • Mu Li Harbin Institute of Technology, Shenzhen, China
  • Lunke Fei Guangdong University of Technology, Guangzhou, China
  • Jie Wen Harbin Institute of Technology, Shenzhen, China

DOI:

https://doi.org/10.1609/aaai.v39i4.32443

Abstract

The objective of referring expression comprehension (REC) is to accurately identify the object in an image described by a given expression. Existing REC methods, including transformer-based and graph-based approaches among others, have shown robust performance in REC tasks. In this study, we present a groundbreaking framework named DiffusionREC for REC task. This framework reimagines the REC task as a text guided bounding box denoising diffusion process, through which noisy bounding boxes are refined and distilled to pinpoint the target box. Throughout the training process, the bounding box of the target object diffuses from its ground-truth position towards a random distribution. Simultaneously, a filtering-based object decoder is introduced to reverse this diffusion of noise, conditional on the provided expression, the result from previous denoised step and the interaction between the expression and the image. At the inference stage, we begin by randomly generating a collection of boxes. Subsequently, the filtering-based object decoder is iteratively employed to refine and prune these bounding boxes, taking into account the conditions on the given expression, the results from the previous denoised step, and the interaction between the expression and the image. Extensive experiments conducted on six datasets demonstrate that DiffusionREC outperforms previous REC methods, yielding superior performances.

Downloads

Published

2025-04-11

How to Cite

Ke, J., Wong, W., Wang, J., Li, M., Fei, L., & Wen, J. (2025). DiffusionREC: Diffusion Model with Adaptive Condition for Referring Expression Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 39(4), 4221-4229. https://doi.org/10.1609/aaai.v39i4.32443

Issue

Section

AAAI Technical Track on Computer Vision III