LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Yipeng Zhang; Yifan Liu; Zonghao Guo; Yidan Zhang; Xuesong Yang; Xiaoying Zhang; Chi Chen; Jun Song; Yuan Yao; Tat-Seng Chua; Maosong Sun

doi:10.1609/aaai.v40i15.38292

Authors

Yipeng Zhang Tsinghua University
Yifan Liu Tsinghua University
Zonghao Guo Tsinghua University
Yidan Zhang University of the Chinese Academy of Sciences
Xuesong Yang University of the Chinese Academy of Sciences
Xiaoying Zhang The Chinese University of Hong Kong
Chi Chen Tsinghua University
Jun Song Alibaba Group
Yuan Yao Shanghai Qi Zhi Institute National University of Singapore
Tat-Seng Chua National University of Singapore
Maosong Sun Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i15.38292

Abstract

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels. To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby constructing an ISP, and (ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP. Notably, our design achieves an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance.

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information