From Scene to Object: Enhancing Open-Vocabulary Object Detection via Foreground-Background Context Reasoning
DOI:
https://doi.org/10.1609/aaai.v40i8.37590Abstract
Open-Vocabulary Object Detection (OVOD) aims to detect both known and novel categories in complex visual scenes, surpassing the limitations of conventional closed-set detectors. Recent advances in vision-language models (VLMs) like CLIP have enabled zero-shot recognition by aligning visual features with large-scale textual embeddings. However, current OVOD approaches often fall short by overlooking critical contextual and semantic cues necessary for discovering a broader range of novel objects. To address this, we propose BFDet, a scene-to-object reasoning framework that leverages the complementary strengths of Large Language Models (LLMs) and VLMs. BFDet introduces a novel scene-to-object reasoning mechanism grounded in foreground-background context interaction. It first uses high-confidence objects to infer the scene-level background. This scene background then guides the discovery of foreground objects by prompting an LLM to generate scene-sensitive novel object candidates. These candidates are subsequently verified through cross-modal alignment and used as high-quality pseudo-labels to enrich detector training. Designed as a plug-and-play module, BFDet integrates seamlessly into existing detection pipelines and consistently improves performance on novel categories across COCO and LVIS benchmarks.Published
2026-03-14
How to Cite
Li, Y., Niu, J., Gu, N., & Ren, T. (2026). From Scene to Object: Enhancing Open-Vocabulary Object Detection via Foreground-Background Context Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6601–6609. https://doi.org/10.1609/aaai.v40i8.37590
Issue
Section
AAAI Technical Track on Computer Vision V