AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning

Authors

  • Jun Gao School of Computer Science and Technology, Soochow University, China
  • Qian Qiao School of Computer Science and Technology, Soochow University, China
  • Tianxiang Wu School of Computer Science and Technology, Soochow University, China
  • Zili Wang Independent Researcher
  • Ziqiang Cao School of Computer Science and Technology, Soochow University, China
  • Wenjie Li Computation Department, The Hong Kong Polytechnic University, Hong Kong

DOI:

https://doi.org/10.1609/aaai.v39i3.32316

Abstract

In-context learning (ICL) advances Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multimodal Large Language Models (MLLMs), two problems hinder the application of multimodal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read extra multimodal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM focuses more on the linguistic modality within multimodal demonstrations during generation. Therefore, we propose a general and lightweight framework AIM to tackle the mentioned problems through Aggregating Image information of Multimodal demonstrations to the latent space of the corresponding textual labels. After aggregation, AIM substitutes each demonstration with generated fused virtual tokens whose length is reduced to the same as its texts. Except for shortening input length, AIM further upgrades MLLMs pre-trained on image-text pairs to support multimodal ICL, as images from demonstrations are disregarded. Furthermore, benefiting from aggregating different demonstrations independently, AIM configures Demonstration Bank (DB) to avoid repeated aggregation, which significantly boosts model efficiency. We build AIM upon QWen-VL and LLaVA-Next, and AIM is comprehensively evaluated on image caption, VQA, and hateful speech detection. Outstanding results reveal that AIM provides an efficient and effective solution in upgrading MLLMs for multimodal ICL.

Downloads

Published

2025-04-11

How to Cite

Gao, J., Qiao, Q., Wu, T., Wang, Z., Cao, Z., & Li, W. (2025). AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(3), 3077–3085. https://doi.org/10.1609/aaai.v39i3.32316

Issue

Section

AAAI Technical Track on Computer Vision II