A Dataset to Assess Microsoft Copilot Answers in the Context of Swiss, Bavarian and Hessian Elections

Salvatore Romano; Riccardo Angius; Natalie Kerby; Paul Bouchaud; Jacopo Amidei; Andreas Kaltenbrunner

doi:10.1609/icwsm.v18i1.31446

Authors

Salvatore Romano AI and Data for Society, IN3, Universitat Oberta de Catalunya, Barcelona, Spain AI Forensics, Paris, France
Riccardo Angius AI Forensics, Paris, France
Natalie Kerby AI Forensics, Paris, France
Paul Bouchaud AI Forensics, Paris, France Center for Social Analysis and Mathematics, Paris, France Complex Systems Institute of Paris, France
Jacopo Amidei AI and Data for Society, IN3, Universitat Oberta de Catalunya, Barcelona, Spain
Andreas Kaltenbrunner AI and Data for Society, IN3, Universitat Oberta de Catalunya, Barcelona, Spain ISI Foundation, Turin, Italy

DOI:

https://doi.org/10.1609/icwsm.v18i1.31446

Abstract

This study describes a dataset that allows to assess the emerging challenges posed by Generative Artificial Intelligence when doing Active Retrieval Augmented Generation (RAG), especially when summarizing trustworthy sources on the Internet. As a case study, we focus on Microsoft Copilot, an innovative software that integrates Large Language Models (LLMs) and Search Engines (SE) making advanced AI accessible to the general public. The core contribution of this paper is the presentation of the largest public database to date of RAG responses to user prompts, collected during the 2023 electoral campaigns in Switzerland, Bavaria and Hesse. This dataset was compiled with the assistance of a group of experts who posed realistic voter questions and conducted fact-checking of Microsoft Copilot's responses. It contains prompts and answers in English, German, French and Italian. All the collection happened during the electoral campaign, between 21 August 2023 and 2 October 2023. The paper makes available the full set of 5,561 pairs of prompts and answers, including the URLs referenced in the answers. In addition to the dataset itself, we provide 1374 answers labelled by a group of experts who rated the accuracy of the answers in providing factual information, showing that almost one out of three times the chatbot responded with either factually incorrect information or completely nonsensical answers. This resource is intended to facilitate further research into the performance of LLMs in the context of elections, defined as a "high-risk scenario" by the Digital Services Act (DSA) Article 34(1)(c).

A Dataset to Assess Microsoft Copilot Answers in the Context of Swiss, Bavarian and Hessian Elections

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information