FA-KES: A Fake News Dataset around the Syrian War
DOI:
https://doi.org/10.1609/icwsm.v13i01.3254Abstract
Most currently available fake news datasets revolve around US politics, entrainment news or satire. They are typically scraped from fact-checking websites, where the articles are labeled by human experts. In this paper, we present FA-KES, a fake news dataset around the Syrian war. Given the specific nature of news reporting on incidents of wars and the lack of available sources from which manually-labeled news articles can be scraped, we believe a fake news dataset specifically constructed for this domain is crucial. To ensure a balanced dataset that covers the many facets of the Syrian war, our dataset consists of news articles from several media outlets representing mobilisation press, loyalist press, and diverse print media. To avoid the difficult and often-subjective task of manually labeling news articles as true or fake, we employ a semi-supervised fact-checking approach to label the news articles in our dataset. With the help of crowdsourcing, human contributors are prompted to extract specific and easy-to-extract information that helps match a given article to information representing “ground truth” obtained from the Syrian Violations Documentation Center. The information extracted is then used to cluster the articles into two separate sets using unsupervised machine learning. The result is a carefully annotated dataset consisting of 804 articles labeled as true or fake and that is ideal for training machine learning models to predict the credibility of news articles. Our dataset is publicly available at https://doi.org/10.5281/zenodo.2607278. Although our dataset is focused on the Syrian crisis, it can be used to train machine learning models to detect fake news in other related domains. Moreover, the framework we used to obtain the dataset is general enough to be used to build other fake news datasets around military conflicts, provided there is some corresponding ground-truth available.