Deep Research Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks

Haiyuan Wan; Chen Yang; Junchi Yu; Meiqi Tu; Jiaxuan Lu; Di Yu; Jianbao Cao; Ben Gao; Jiaqing Xie; Aoran Wang; Wenlong Zhang; Philip Torr; Dongzhan Zhou

doi:10.1609/aaai.v40i39.40620

Authors

Haiyuan Wan Shanghai Artificial Intelligence Laboratory Tsinghua University
Chen Yang The Hong Kong University of Science and Technology, Guangzhou
Junchi Yu University of Oxford
Meiqi Tu University of Hong Kong
Jiaxuan Lu Shanghai Artificial Intelligence Laboratory
Di Yu Shanghai Artificial Intelligence Laboratory Tsinghua University
Jianbao Cao Shanghai Artificial Intelligence Laboratory Wuhan University
Ben Gao Shanghai Artificial Intelligence Laboratory Wuhan University
Jiaqing Xie Shanghai Artificial Intelligence Laboratory
Aoran Wang Shanghai Artificial Intelligence Laboratory
Wenlong Zhang Shanghai Artificial Intelligence Laboratory
Philip Torr University of Oxford
Dongzhan Zhou Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i39.40620

Abstract

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

Deep Research Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information