Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

Authors

  • Zhouhong Gu Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Xiaoxuan Zhu Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Haoning Ye Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Lin Zhang Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Jianchen Wang Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Yixin Zhu Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Sihang Jiang Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Zhuozhi Xiong Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Zihan Li Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Weijie Wu Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Qianyu He Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Rui Xu Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Wenhao Huang Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Jingping Liu School of Information Science and Engineering, East China University of Science and Technology
  • Zili Wang Xiaohongshu Inc
  • Shusen Wang Xiaohongshu Inc
  • Weiguo Zheng School of Data Science, Fudan University
  • Hongwei Feng Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China
  • Yanghua Xiao Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China Fudan-Aishu Cognitive Intelligence Joint Research Center

DOI:

https://doi.org/10.1609/aaai.v38i16.29767

Keywords:

NLP: (Large) Language Models, NLP: Applications

Abstract

New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty with 14,041 questions and Xiezhi-Interdiscipline with 10,746 questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. All the evaluation code and data are open sourced in https://github.com/MikeGu721/XiezhiBenchmark

Published

2024-03-24

How to Cite

Gu, Z., Zhu, X., Ye, H., Zhang, L., Wang, J., Zhu, Y., Jiang, S., Xiong, Z., Li, Z., Wu, W., He, Q., Xu, R., Huang, W., Liu, J., Wang, Z., Wang, S., Zheng, W., Feng, H., & Xiao, Y. (2024). Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18099-18107. https://doi.org/10.1609/aaai.v38i16.29767

Issue

Section

AAAI Technical Track on Natural Language Processing I