Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Authors

  • Miao Rang Huawei Noah's Ark Lab
  • Zhenni Bi Huawei Noah's Ark Lab
  • Chuanjian Liu Huawei Technologies Ltd.
  • Yehui Tang Huawei Technologies Ltd.
  • Kai Han Huawei Noah's Ark Lab
  • Yunhe Wang Huawei Noah's Ark Lab

DOI:

https://doi.org/10.1609/aaai.v39i7.32718

Abstract

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary, we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.

Downloads

Published

2025-04-11

How to Cite

Rang, M., Bi, Z., Liu, C., Tang, Y., Han, K., & Wang, Y. (2025). Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 6694–6702. https://doi.org/10.1609/aaai.v39i7.32718

Issue

Section

AAAI Technical Track on Computer Vision VI