Explore What LLM Does Not Know in Complex Question Answering

Authors

  • Xin Lin School of Computer Science and Technology, University of Science and Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China
  • Zhenya Huang School of Computer Science and Technology, University of Science and Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
  • Zhiqiang Zhang Independent Researcher
  • Jun Zhou Zhejiang University, Hangzhou, China
  • Enhong Chen School of Computer Science and Technology, University of Science and Technology of China, Hefei, China State Key Laboratory of Cognitive Intelligence, Hefei, China

DOI:

https://doi.org/10.1609/aaai.v39i23.34638

Abstract

Complex question answering (QA) is a challenging task in artificial intelligence research which requires reasoning based on related knowledge. The retrieval-augmented generation (RAG) based on large language models (LLMs) have become one promising solution in QA. To facilitate RAG more effectively, the LLM needs to precisely evaluate knowledge required in QA. That is, first, the LLM needs to examine its knowledge boundary (what the LLM does not know) to retrieve external knowledge as supplement. Second, the LLM needs to evaluate the utility of the retrieved knowledge (whether it helps in reasoning) for robust RAG. To this end, in this paper, we propose a novel Question Answering with Knowledge Evaluation (KEQA) framework to promote the effectiveness and efficiency of RAG in QA. First, inspired by quizzes in classroom, we propose a quiz-based method to precisely examine the knowledge state of the uninterpretable LLM for QA. We ask indicative quizzes on each required knowledge, and inspect whether the LLM can consistently answer the quiz to examine its knowledge boundary. Second, we retrieve the unknown knowledge from external source, and evaluate its utility to pick the helpful ones for reasoning. We design a reasoning-based metric to evaluate utility, and construct a demonstration set in training data for reference to guide knowledge picking in inference. We conduct extensive experiments on four widely-used QA datasets, and the results demonstrate the effectiveness of the proposed method.

Published

2025-04-11

How to Cite

Lin, X., Huang, Z., Zhang, Z., Zhou, J., & Chen, E. (2025). Explore What LLM Does Not Know in Complex Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24585-24594. https://doi.org/10.1609/aaai.v39i23.34638

Issue

Section

AAAI Technical Track on Natural Language Processing II