LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid
DOI:
https://doi.org/10.1609/aaai.v40i15.38292Abstract
Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels. To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby constructing an ISP, and (ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP. Notably, our design achieves an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance.Downloads
Published
2026-03-14
How to Cite
Zhang, Y., Liu, Y., Guo, Z., Zhang, Y., Yang, X., Zhang, X., … Sun, M. (2026). LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12934–12942. https://doi.org/10.1609/aaai.v40i15.38292
Issue
Section
AAAI Technical Track on Computer Vision XII