OmniBench: A Comprehensive Benchmark Integrating Real-World, Time-sensitive, and Multi-Hop Questions with a Multi-Dimensional Hybrid Evaluation Framework

Wenjie Wang; Yufeng Jiang; Ge Sun; Chenghang Dong; Zheng Jun; Li Mengjie; Lixin Chen; Huan Wang; Haoyu Wang; Bin Chen

doi:10.1609/aaai.v40i40.40655

Authors

Wenjie Wang Ant Group
Yufeng Jiang Ant Group
Ge Sun Ant Group
Chenghang Dong Ant Group
Zheng Jun Ant Group
Li Mengjie Ant Group
Lixin Chen Ant Group
Huan Wang Ant Group
Haoyu Wang Ant Group
Bin Chen Ant Group

DOI:

https://doi.org/10.1609/aaai.v40i40.40655

Abstract

Recently, with the increasing capabilities of Large Language Models (LLMs), AI applications have gradually emerged to solve various problems in people's daily lives, so accurately measuring their performance and reliability is paramount. However, existing benchmarks predominantly rely on closed-ended, multiple-choice or short-answer question formats. While useful for assessment, these formats exhibit a significant gap compared to the diverse and open-ended nature of questions posed by real-world users. To bridge this gap, we produce OmniBench, a comprehensive open-domain benchmark. OmniBench is uniquely composed of authentic, user-generated questions harvested from real-world interactions on various websites and applications, covering 16 rigorously defined knowledge domains and 5 crucial user intents derived from a large-scale analysis of the mass corpus. Crucially, we propose three automated data construction pipelines that enable the continuous and periodic updating of the benchmark dataset. This approach not only ensures that the questions can keep up with current events, but also effectively mitigates the critical issue of data contamination prevalent in static benchmarks. Moreover, a multi-dimensional hybrid evaluation framework named OmniEval is proposed for evaluating the responses. This framework combines diverse metrics and evaluation methods to capture nuanced aspects of answer performance. Extensive validation demonstrates that this evaluation framework exhibits strong alignment with human judgments, ensuring the reliability of the benchmark results.

OmniBench: A Comprehensive Benchmark Integrating Real-World, Time-sensitive, and Multi-Hop Questions with a Multi-Dimensional Hybrid Evaluation Framework

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information