Tensor Decomposition and Language Description for Open-Vocabulary Object Detection

Qiuyu Liang; Yongqiang Zhang

doi:10.1609/aaai.v40i9.37618

Authors

Qiuyu Liang College of Computer Science, Inner Mongolia University, Hohhot, China
Yongqiang Zhang College of Computer Science, Inner Mongolia University, Hohhot, China

DOI:

https://doi.org/10.1609/aaai.v40i9.37618

Abstract

Open-vocabulary object detection (OVOD) aims at detecting and recognizing objects beyond a fixed set of classes. Although region-word alignment and knowledge distillation have been explored for training a strong open-vocabulary detector, our analysis reveals three main issues (inaccurate alignment, redundant distillation, and low-quality class embedding) that limit OVOD's performance. In this paper, we explore the well-designed Tensor decomposition and Language descriptions for open-vocabulary object Detection (called TLDet). Proposals with the highest similarity score often correspond to discriminative but incomplete regions (e.g., object heads), resulting in inaccurate region-word alignment. To mitigate this issue, we propose a low-rank proposal filtering module that quantitatively assesses the completeness of each proposal by performing singular value decomposition and computing the sum of its singular values. This allows the model to reduce discriminative proposals and enhance the precision of alignment between visual regions and textual concepts. Furthermore, to mitigate redundant knowledge transfer, we introduce a core tensor distillation approach that decomposes teacher and student features into core tensors via Tucker decomposition and performs distillation through optimized tensor alignment. This ensures that the student acquires the most essential knowledge from the teacher. Finally, to improve the quality of class embedding, a language description enhancement method is proposed by exploring the knowledge of LLM to enrich the representations of categories during inference. Extensive experiments on popular datasets demonstrate the superior performance of our TLDet, achieving 36.1% mAP on COCO and 30.1% mask mAP on LVIS, and outperforming existing methods on novel categories.

Tensor Decomposition and Language Description for Open-Vocabulary Object Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information