LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Authors

  • Yipeng Zhang Tsinghua University
  • Yifan Liu Tsinghua University
  • Zonghao Guo Tsinghua University
  • Yidan Zhang University of the Chinese Academy of Sciences
  • Xuesong Yang University of the Chinese Academy of Sciences
  • Xiaoying Zhang The Chinese University of Hong Kong
  • Chi Chen Tsinghua University
  • Jun Song Alibaba Group
  • Yuan Yao Shanghai Qi Zhi Institute National University of Singapore
  • Tat-Seng Chua National University of Singapore
  • Maosong Sun Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i15.38292

Abstract

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels. To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby constructing an ISP, and (ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP. Notably, our design achieves an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance.

Downloads

Published

2026-03-14

How to Cite

Zhang, Y., Liu, Y., Guo, Z., Zhang, Y., Yang, X., Zhang, X., … Sun, M. (2026). LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12934–12942. https://doi.org/10.1609/aaai.v40i15.38292

Issue

Section

AAAI Technical Track on Computer Vision XII