ObjecTok: Learning Holistic and Robust Object Tokens for MLLMs

Authors

  • Sihan Wang State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Xiyao Liu State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences
  • Lianqing Liu State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences
  • Zhi Han State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i12.37975

Abstract

Mainstream multimodal large language models (MLLMs) rely on patch-based tokenization methods, which compromise the integrity of objects and thereby limit the model's perception capabilities while triggering object-related hallucinations. To address this issue, we propose ObjecTok, an innovative object tokenization framework. ObjecTok generates a single, holistic object token for each object in an image. This token is produced by a specially trained object encoder that embeds the object's semantic, positional, and shape information into a single compact representation, thereby preserving the object's integrity. To mitigate the imperfections of upstream object proposer models, we introduce learnable confidence embeddings. These embeddings enable the MLLM to learn the reliability of each object's information, significantly enhancing the model's robustness. Additionally, ObjecTok employs a hybrid input strategy, combining object tokens with traditional image patch tokens, allowing the model to leverage both object-level information and global scene context. By integrating ObjecTok into the LLaVA architecture, we achieve notable performance improvements on multiple object-centric benchmarks, effectively reducing object hallucinations and enhancing perception capabilities. Experimental results robustly demonstrate that the object tokens generated by our ObjecTok framework hold great potential for building more powerful and reliable MLLMs.

Downloads

Published

2026-03-14

How to Cite

Wang, S., Liu, X., Liu, L., & Han, Z. (2026). ObjecTok: Learning Holistic and Robust Object Tokens for MLLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10083–10091. https://doi.org/10.1609/aaai.v40i12.37975

Issue

Section

AAAI Technical Track on Computer Vision IX