LLaVA-MS-PIT: Multi-Modal Schema-Guided Progressive Instruction Tuning for Multi-Modal Event Extraction
DOI:
https://doi.org/10.1609/aaai.v40i41.40770Abstract
The proliferation of multi-modal data on the internet has intensified the need for structured event understanding across textual and visual modalities. However, existing multi-modal event extraction models suffer from three major limitations: the absence of explicit event schema guidance, coarse-grained multi-modal alignment strategies, and reliance on heterogeneous, misaligned multi-modal training datasets. To address these issues, we propose LLaVA-MS-PIT, a Multi-modal Schema-Guided Progressive Instruction Tuning Framework that explicitly injects structured multi-modal event schema knowledge into the model before event extraction. Specifically, we introduce the textual event schema to establish the model’s prior knowledge of event concepts and enhance its ability to reason about event structures, while the visual event schema is employed to bridge the representation gap between textual and visual modalities at the event level, enabling unified and semantically aligned event representations across modalities. Moreover, to alleviate data scarcity and modality misalignment inherent in current benchmarks, we construct imSitu-MEE, a high-quality multi-modal parallel dataset generated and annotated through schema-guided procedures. Extensive experiments demonstrate that LLaVA-MS-PIT achieves competitive performance on multi-modal event extraction benchmarks, underscoring the effectiveness and necessity of schema-guided progressive instruction tuning.Published
2026-03-14
How to Cite
Zhang, H., Hu, P., & Zhang, W. E. (2026). LLaVA-MS-PIT: Multi-Modal Schema-Guided Progressive Instruction Tuning for Multi-Modal Event Extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34692–34700. https://doi.org/10.1609/aaai.v40i41.40770
Issue
Section
AAAI Technical Track on Natural Language Processing VI