From Scene to Object: Enhancing Open-Vocabulary Object Detection via Foreground-Background Context Reasoning

Authors

  • Yanqi Li Beihang University Zhongguancun Laboratory
  • Jianwei Niu Beihang University Zhongguancun Laboratory Hangzhou Innovation Institute of Beihang University
  • Ningbo Gu Beihang University Hangzhou Innovation Institute of Beihang University
  • Tao Ren University of the Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i8.37590

Abstract

Open-Vocabulary Object Detection (OVOD) aims to detect both known and novel categories in complex visual scenes, surpassing the limitations of conventional closed-set detectors. Recent advances in vision-language models (VLMs) like CLIP have enabled zero-shot recognition by aligning visual features with large-scale textual embeddings. However, current OVOD approaches often fall short by overlooking critical contextual and semantic cues necessary for discovering a broader range of novel objects. To address this, we propose BFDet, a scene-to-object reasoning framework that leverages the complementary strengths of Large Language Models (LLMs) and VLMs. BFDet introduces a novel scene-to-object reasoning mechanism grounded in foreground-background context interaction. It first uses high-confidence objects to infer the scene-level background. This scene background then guides the discovery of foreground objects by prompting an LLM to generate scene-sensitive novel object candidates. These candidates are subsequently verified through cross-modal alignment and used as high-quality pseudo-labels to enrich detector training. Designed as a plug-and-play module, BFDet integrates seamlessly into existing detection pipelines and consistently improves performance on novel categories across COCO and LVIS benchmarks.

Downloads

Published

2026-03-14

How to Cite

Li, Y., Niu, J., Gu, N., & Ren, T. (2026). From Scene to Object: Enhancing Open-Vocabulary Object Detection via Foreground-Background Context Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6601–6609. https://doi.org/10.1609/aaai.v40i8.37590

Issue

Section

AAAI Technical Track on Computer Vision V