Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources
DOI:
https://doi.org/10.1609/aaai.v40i48.42355Abstract
Large language models (LLMs) often produce factually inaccurate content, or hallucinations, which undermines their reliability. Existing factuality evaluation systems usually rely on a single predefined fact source, making them task-specific and hard to extend. We present UFO, a unified framework for factuality evaluation that supports multiple plug-and-play fact sources. UFO integrates human-written evidence, web search results, and LLM knowledge within a single evaluation pipeline, and allows users to flexibly select, reorder, and even define customized sources. The system is accessible through both a Python interface and a web-based demo, offering interactive claim-level verification and visualization. Experiments show that UFO system achieves moderate consistency with human annotations. Overall, UFO serves as a transparent and extensible platform for benchmarking fact sources, comparing LLMs, and enabling real-world fact-checking applications across diverse domains.Downloads
Published
2026-03-14
How to Cite
Huang, Z., Zhu, Y., Wen, J., & Dou, Z. (2026). Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41607–41609. https://doi.org/10.1609/aaai.v40i48.42355