OmniBench: A Comprehensive Benchmark Integrating Real-World, Time-sensitive, and Multi-Hop Questions with a Multi-Dimensional Hybrid Evaluation Framework

Authors

  • Wenjie Wang Ant Group
  • Yufeng Jiang Ant Group
  • Ge Sun Ant Group
  • Chenghang Dong Ant Group
  • Zheng Jun Ant Group
  • Li Mengjie Ant Group
  • Lixin Chen Ant Group
  • Huan Wang Ant Group
  • Haoyu Wang Ant Group
  • Bin Chen Ant Group

DOI:

https://doi.org/10.1609/aaai.v40i40.40655

Abstract

Recently, with the increasing capabilities of Large Language Models (LLMs), AI applications have gradually emerged to solve various problems in people's daily lives, so accurately measuring their performance and reliability is paramount. However, existing benchmarks predominantly rely on closed-ended, multiple-choice or short-answer question formats. While useful for assessment, these formats exhibit a significant gap compared to the diverse and open-ended nature of questions posed by real-world users. To bridge this gap, we produce OmniBench, a comprehensive open-domain benchmark. OmniBench is uniquely composed of authentic, user-generated questions harvested from real-world interactions on various websites and applications, covering 16 rigorously defined knowledge domains and 5 crucial user intents derived from a large-scale analysis of the mass corpus. Crucially, we propose three automated data construction pipelines that enable the continuous and periodic updating of the benchmark dataset. This approach not only ensures that the questions can keep up with current events, but also effectively mitigates the critical issue of data contamination prevalent in static benchmarks. Moreover, a multi-dimensional hybrid evaluation framework named OmniEval is proposed for evaluating the responses. This framework combines diverse metrics and evaluation methods to capture nuanced aspects of answer performance. Extensive validation demonstrates that this evaluation framework exhibits strong alignment with human judgments, ensuring the reliability of the benchmark results.

Downloads

Published

2026-03-14

How to Cite

Wang, W., Jiang, Y., Sun, G., Dong, C., Jun, Z., Mengjie, L., … Chen, B. (2026). OmniBench: A Comprehensive Benchmark Integrating Real-World, Time-sensitive, and Multi-Hop Questions with a Multi-Dimensional Hybrid Evaluation Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33657–33665. https://doi.org/10.1609/aaai.v40i40.40655

Issue

Section

AAAI Technical Track on Natural Language Processing V