MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

Authors

  • Liuyue Xie Carnegie Mellon University
  • Avik Kuthiala Carnegie Mellon University
  • George Z Wei Carnegie Mellon University
  • Ce Zheng Carnegie Mellon University
  • Ananya Bal Carnegie Mellon University
  • Mosam Dabhi Carnegie Mellon University
  • Liting Wen Carnegie Mellon University
  • Taru Rustagi Carnegie Mellon University
  • Ethan Lai Carnegie Mellon University
  • Sushil Khyalia Carnegie Mellon University
  • Rohan Choudhury Carnegie Mellon University
  • Morteza Ziyadi Amazon
  • Xu Zhang Amazon
  • Hao Yang Amazon
  • Laszlo A. Jeni Carnegie Mellon University

DOI:

https://doi.org/10.1609/aaai.v40i32.39923

Abstract

We introduce MAVERIX (Multimodal Audio-Visual Evaluation and Recognition IndeX), a unified benchmark to probe video understanding in multimodal LLMs, encompassing video, audio, and text inputs with human performance baselines. Although recent advancements in audiovisual models have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with questions that closely mimic the multimodal understanding experiences available to humans during decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence, with the website publicly available below.

Downloads

Published

2026-03-14

How to Cite

Xie, L., Kuthiala, A., Wei, G. Z., Zheng, C., Bal, A., Dabhi, M., … Jeni, L. A. (2026). MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27090–27098. https://doi.org/10.1609/aaai.v40i32.39923

Issue

Section

AAAI Technical Track on Machine Learning IX