DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation

Authors

  • Zihui Gu Renmin University of China Tencent Inc.
  • Xingwu Sun Tencent Inc. University of Macau
  • Fengzong Lian Tencent Inc.
  • Zhanhui Kang Tencent Inc.
  • Chengzhong Xu University of Macau
  • Ju Fan Renmin University of China

DOI:

https://doi.org/10.1609/aaai.v38i16.29768

Keywords:

NLP: Interpretability, Analysis, and Evaluation of NLP Models, ML: Evaluation and Analysis, NLP: (Large) Language Models, NLP: Applications

Abstract

Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction-following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.

Published

2024-03-24

How to Cite

Gu, Z., Sun, X., Lian, F., Kang, Z., Xu, C., & Fan, J. (2024). DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18108-18116. https://doi.org/10.1609/aaai.v38i16.29768

Issue

Section

AAAI Technical Track on Natural Language Processing I