AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning

Jun Gao; Qian Qiao; Tianxiang Wu; Zili Wang; Ziqiang Cao; Wenjie Li

doi:10.1609/aaai.v39i3.32316

Authors

Jun Gao School of Computer Science and Technology, Soochow University, China
Qian Qiao School of Computer Science and Technology, Soochow University, China
Tianxiang Wu School of Computer Science and Technology, Soochow University, China
Zili Wang Independent Researcher
Ziqiang Cao School of Computer Science and Technology, Soochow University, China
Wenjie Li Computation Department, The Hong Kong Polytechnic University, Hong Kong

DOI:

https://doi.org/10.1609/aaai.v39i3.32316

Abstract

In-context learning (ICL) advances Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multimodal Large Language Models (MLLMs), two problems hinder the application of multimodal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read extra multimodal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM focuses more on the linguistic modality within multimodal demonstrations during generation. Therefore, we propose a general and lightweight framework AIM to tackle the mentioned problems through Aggregating Image information of Multimodal demonstrations to the latent space of the corresponding textual labels. After aggregation, AIM substitutes each demonstration with generated fused virtual tokens whose length is reduced to the same as its texts. Except for shortening input length, AIM further upgrades MLLMs pre-trained on image-text pairs to support multimodal ICL, as images from demonstrations are disregarded. Furthermore, benefiting from aggregating different demonstrations independently, AIM configures Demonstration Bank (DB) to avoid repeated aggregation, which significantly boosts model efficiency. We build AIM upon QWen-VL and LLaVA-Next, and AIM is comprehensively evaluated on image caption, VQA, and hateful speech detection. Outstanding results reveal that AIM provides an efficient and effective solution in upgrading MLLMs for multimodal ICL.

AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information