MMIFEvol: Towards Evolutionary Multimodal Instruction Following

Haoyu Wang; Sihang Jiang; Xiangru Zhu; Yuyan Chen; Xiaojun Meng; Jiansheng Wei; Yitong Wang; Yanghua Xiao

doi:10.1609/aaai.v40i31.39824

Authors

Haoyu Wang Fudan University
Sihang Jiang Fudan University
Xiangru Zhu Fudan University
Yuyan Chen Fudan University Cornell University
Xiaojun Meng Huawei Large Model Data Technology Lab
Jiansheng Wei Huawei Large Model Data Technology Lab
Yitong Wang Fudan University
Yanghua Xiao Fudan University

DOI:

https://doi.org/10.1609/aaai.v40i31.39824

Abstract

Multimodal Instruction Following serves as a fundamental capability of multimodal language models, involving accurate comprehension and execution of user-provided instructions. However, existing multimodal instruction-following datasets and benchmarks face the shortcomings outlined below: (a) Lack of Difficulty Stratification, they collect diverse instruction categories but neglect the stratification of difficulty levels across these categories, which leads to overlap, bias, and low interpretability. (b) Lack of Fine-Grained Metrics, they conflate the model's ability to ``solve tasks" and ``follow constraints" into a single metric, which fails to accurately reflect its instruction-following capability. (c) Lack of Multi-Task Instructions, they overlook the fact that real-world user instructions often consist of multiple combined tasks. This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. First, we define the essential components of a carefully curated multimodal instruction set and establish corresponding difficulty levels, based on which we synthesize diverse instruction data. Next, we decouple the evaluation criteria for the instruction following into three different metrics to construct a high-quality benchmark and assess existing models. Experimental results demonstrate that current models still struggle with following complex instructions, while fine-tuning using MMIFEvol data effectively improves models' responsiveness to multimodal instructions.

MMIFEvol: Towards Evolutionary Multimodal Instruction Following

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information