Can Large Language Models Understand Real-World Complex Instructions?

Authors

  • Qianyu He Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
  • Jie Zeng Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
  • Wenhao Huang Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
  • Lina Chen School of Data Science, Fudan University
  • Jin Xiao School of Data Science, Fudan University
  • Qianxi He Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
  • Xunzhe Zhou Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
  • Jiaqing Liang School of Data Science, Fudan University
  • Yanghua Xiao Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China

DOI:

https://doi.org/10.1609/aaai.v38i16.29777

Keywords:

NLP: (Large) Language Models, NLP: Interpretability, Analysis, and Evaluation of NLP Models

Abstract

Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.

Published

2024-03-24

How to Cite

He, Q., Zeng, J., Huang, W., Chen, L., Xiao, J., He, Q., Zhou, X., Liang, J., & Xiao, Y. (2024). Can Large Language Models Understand Real-World Complex Instructions?. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18188-18196. https://doi.org/10.1609/aaai.v38i16.29777

Issue

Section

AAAI Technical Track on Natural Language Processing I