Evaluation Dimensions for Assessing Question Answer Systems for Lay Users: The Case of DiseaseGuru
AbstractQuestion answer (QA) systems can serve as vital tools to address lay users’ information needs in healthcare. While QA systems have the potential to lessen information overload and provide quality answers to users, it is important to holistically evaluate their performance. Here we propose multiple dimensions for this purpose comprising lexical similarity, semantic similarity, absence of contradictions and readability of responses. We then use the dimensions to evaluate DiseaseGuru, a generative large language model-based chronic disease QA system we developed that integrates knowledge graph technology to provide quality responses to lay users. The results are presented comparing it with three benchmark algorithms across the different dimensions. We also propose metrics for lay users and medical professionals for a future field study to evaluate the system.
Building Connections: From Human-Human to Human-AI Collaboration