MMIFEvol: Towards Evolutionary Multimodal Instruction Following

Authors

  • Haoyu Wang Fudan University
  • Sihang Jiang Fudan University
  • Xiangru Zhu Fudan University
  • Yuyan Chen Fudan University Cornell University
  • Xiaojun Meng Huawei Large Model Data Technology Lab
  • Jiansheng Wei Huawei Large Model Data Technology Lab
  • Yitong Wang Fudan University
  • Yanghua Xiao Fudan University

DOI:

https://doi.org/10.1609/aaai.v40i31.39824

Abstract

Multimodal Instruction Following serves as a fundamental capability of multimodal language models, involving accurate comprehension and execution of user-provided instructions. However, existing multimodal instruction-following datasets and benchmarks face the shortcomings outlined below: (a) Lack of Difficulty Stratification, they collect diverse instruction categories but neglect the stratification of difficulty levels across these categories, which leads to overlap, bias, and low interpretability. (b) Lack of Fine-Grained Metrics, they conflate the model's ability to ``solve tasks" and ``follow constraints" into a single metric, which fails to accurately reflect its instruction-following capability. (c) Lack of Multi-Task Instructions, they overlook the fact that real-world user instructions often consist of multiple combined tasks. This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. First, we define the essential components of a carefully curated multimodal instruction set and establish corresponding difficulty levels, based on which we synthesize diverse instruction data. Next, we decouple the evaluation criteria for the instruction following into three different metrics to construct a high-quality benchmark and assess existing models. Experimental results demonstrate that current models still struggle with following complex instructions, while fine-tuning using MMIFEvol data effectively improves models' responsiveness to multimodal instructions.

Downloads

Published

2026-03-14

How to Cite

Wang, H., Jiang, S., Zhu, X., Chen, Y., Meng, X., Wei, J., … Xiao, Y. (2026). MMIFEvol: Towards Evolutionary Multimodal Instruction Following. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26206–26214. https://doi.org/10.1609/aaai.v40i31.39824

Issue

Section

AAAI Technical Track on Machine Learning VIII