ObjecTok: Learning Holistic and Robust Object Tokens for MLLMs

Sihan Wang; Xiyao Liu; Lianqing Liu; Zhi Han

doi:10.1609/aaai.v40i12.37975

Authors

Sihan Wang State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences
Xiyao Liu State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences
Lianqing Liu State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences
Zhi Han State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i12.37975

Abstract

Mainstream multimodal large language models (MLLMs) rely on patch-based tokenization methods, which compromise the integrity of objects and thereby limit the model's perception capabilities while triggering object-related hallucinations. To address this issue, we propose ObjecTok, an innovative object tokenization framework. ObjecTok generates a single, holistic object token for each object in an image. This token is produced by a specially trained object encoder that embeds the object's semantic, positional, and shape information into a single compact representation, thereby preserving the object's integrity. To mitigate the imperfections of upstream object proposer models, we introduce learnable confidence embeddings. These embeddings enable the MLLM to learn the reliability of each object's information, significantly enhancing the model's robustness. Additionally, ObjecTok employs a hybrid input strategy, combining object tokens with traditional image patch tokens, allowing the model to leverage both object-level information and global scene context. By integrating ObjecTok into the LLaVA architecture, we achieve notable performance improvements on multiple object-centric benchmarks, effectively reducing object hallucinations and enhancing perception capabilities. Experimental results robustly demonstrate that the object tokens generated by our ObjecTok framework hold great potential for building more powerful and reliable MLLMs.

ObjecTok: Learning Holistic and Robust Object Tokens for MLLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information