ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Yin Xie; Kaicheng Yang; Peirou Liang; Xiang An; Yongle Zhao; Yumeng Wang; Ziyong Feng; Roy Miles; Ismail Elezi; Jiankang Deng

doi:10.1609/aaai.v40i32.39924

Authors

Yin Xie DeepGlint
Kaicheng Yang DeepGlint
Peirou Liang University of Science and Technology of China
Xiang An DeepGlint
Yongle Zhao DeepGlint
Yumeng Wang DeepGlint
Ziyong Feng DeepGlint
Roy Miles Huawei Technologies Ltd.
Ismail Elezi Huawei Technologies Ltd.
Jiankang Deng Imperial College London

DOI:

https://doi.org/10.1609/aaai.v40i32.39924

Abstract

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEEDI, and RealWorldQA benchmarks, respectively.

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information