Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

Zhaoheng Huang; Yutao Zhu; Jirong Wen; Zhicheng Dou

doi:10.1609/aaai.v40i48.42355

Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

Authors

Zhaoheng Huang Renmin University of China
Yutao Zhu Renmin University of China
Jirong Wen Renmin University of China
Zhicheng Dou Renmin University of China

DOI:

https://doi.org/10.1609/aaai.v40i48.42355

Abstract

Large language models (LLMs) often produce factually inaccurate content, or hallucinations, which undermines their reliability. Existing factuality evaluation systems usually rely on a single predefined fact source, making them task-specific and hard to extend. We present UFO, a unified framework for factuality evaluation that supports multiple plug-and-play fact sources. UFO integrates human-written evidence, web search results, and LLM knowledge within a single evaluation pipeline, and allows users to flexibly select, reorder, and even define customized sources. The system is accessible through both a Python interface and a web-based demo, offering interactive claim-level verification and visualization. Experiments show that UFO system achieves moderate consistency with human annotations. Overall, UFO serves as a transparent and extensible platform for benchmarking fact sources, comparing LLMs, and enabling real-world fact-checking applications across diverse domains.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Huang, Z., Zhu, Y., Wen, J., & Dou, Z. (2026). Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41607–41609. https://doi.org/10.1609/aaai.v40i48.42355

Download Citation

Issue

Vol. 40 No. 48: EAAI-26 AI for Education, Model AI Assignments, AAAI-26 Emerging Trends, Doctoral Consortium, Student Abstracts, Undergraduate Consortium and Demonstrations

Section

AAAI Demonstration Track

Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information