LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages

Authors

  • Junchen Li Du Xiaoman Finance, Beijing, China
  • Qing Yang Du Xiaoman Finance, Beijing, China
  • Bojian Jiang Du Xiaoman Finance, Beijing, China
  • Shaolin Zhu Tianjin University, Tianjin, China
  • Qingxuan Sun Du Xiaoman Finance, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i23.34623

Abstract

Multilingual large language-vision models (LVLMs), which understand and generate both text and images across multiple languages, have achieved remarkable performance on English-centric multimodal generation tasks. However, their performance on non-English tasks has been underwhelming. One major challenge with multilingual LVLMs is the modality gap between visual inputs and multilingual textual inputs/outputs due to the lack of high-quality multilingual training data. In this paper, we propose LRM-LLaVA, a multilingual large language-vision model designed for low-resource languages to overcome the modality gap. It is composed of four components: a visual encoder, a multilingual large language model, a vision-text representation projector, and a cross-modal regularizer. Both the projector and regularizer aim at reducing the modality gap and improving multilingual performance. To train LRM-LLaVA, we employ a two-stage training strategy including pre-training and instruction fine-tuning. Meanwhile, we construct a multilingual visual question answering dataset based on English open-source datasets and adopt multiple task instructions. To evaluate the performance of LVLMs across various languages, we construct four multilingual benchmarks for 10 languages, based on English open-source benchmarks. Experimental results show that LRM-LLaVA achieves competitive performance compared to other multilingual LVLMs of similar parameters.

Downloads

Published

2025-04-11

How to Cite

Li, J., Yang, Q., Jiang, B., Zhu, S., & Sun, Q. (2025). LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24449–24457. https://doi.org/10.1609/aaai.v39i23.34623

Issue

Section

AAAI Technical Track on Natural Language Processing II