From Pixels to Logic: A Perception-Reasoning Decomposition Framework for Open-World Referring Expression Comprehension

Authors

  • Lihong Huang Shenzhen University, College of Computer Science and Software Engineering, Shenzhen, 518060, Guangdong, China
  • Sheng-hua Zhong Shenzhen University, College of Computer Science and Software Engineering, Shenzhen, 518060, Guangdong, China
  • Zhi Zhang The Hong Kong Polytechnic University, Department of Computing, Hong Kong, 999077, China
  • Yan Liu The Hong Kong Polytechnic University, Department of Computing, Hong Kong, 999077, China

DOI:

https://doi.org/10.1609/aaai.v40i7.37419

Abstract

Recent advances in Referring Expression Comprehension (REC) have been largely driven by supervised learning on curated datasets, where each expression is assumed to refer to exactly one known object. However, such assumptions rarely hold in real-world scenarios, where expressions can refer to multiple objects, fail to refer to any, or involve novel categories and complex semantics. These challenges define the task of open-world REC, which demands robust generalization and structured reasoning beyond the scope of traditional REC methods. In this work, we introduce a novel, training-free framework that decouples visual perception from linguistic reasoning to address open-world REC. Our method first transforms the visual scene into a rich textual representation using an open-vocabulary multimodal perception module. It then employs a reasoning language model to interpret the referring expression and perform explicit logical inference over the perceived scene, enabling transparent decision-making and strong generalization in open-world scenarios. Experiments on three standard REC benchmarks as well as two more challenging ones, gRefCOCO and D³, demonstrate that our framework achieves highly competitive zero-shot performance, often surpassing supervised baselines.

Published

2026-03-14

How to Cite

Huang, L., Zhong, S.- hua, Zhang, Z., & Liu, Y. (2026). From Pixels to Logic: A Perception-Reasoning Decomposition Framework for Open-World Referring Expression Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5058–5066. https://doi.org/10.1609/aaai.v40i7.37419

Issue

Section

AAAI Technical Track on Computer Vision IV