Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

Authors

  • Zhaoheng Huang Renmin University of China
  • Yutao Zhu Renmin University of China
  • Jirong Wen Renmin University of China
  • Zhicheng Dou Renmin University of China

DOI:

https://doi.org/10.1609/aaai.v40i48.42355

Abstract

Large language models (LLMs) often produce factually inaccurate content, or hallucinations, which undermines their reliability. Existing factuality evaluation systems usually rely on a single predefined fact source, making them task-specific and hard to extend. We present UFO, a unified framework for factuality evaluation that supports multiple plug-and-play fact sources. UFO integrates human-written evidence, web search results, and LLM knowledge within a single evaluation pipeline, and allows users to flexibly select, reorder, and even define customized sources. The system is accessible through both a Python interface and a web-based demo, offering interactive claim-level verification and visualization. Experiments show that UFO system achieves moderate consistency with human annotations. Overall, UFO serves as a transparent and extensible platform for benchmarking fact sources, comparing LLMs, and enabling real-world fact-checking applications across diverse domains.

Downloads

Published

2026-03-14

How to Cite

Huang, Z., Zhu, Y., Wen, J., & Dou, Z. (2026). Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41607–41609. https://doi.org/10.1609/aaai.v40i48.42355