DivinWD: Exploring the Diversity of Scientific Publications in Wikidata

Authors

  • Zeno Saletti University of Trento
  • Cristian Consonni European Commission, Joint Research Centre
  • Pedro Frau-Amar European Commission, Joint Research Centre
  • Emilia Gómez European Commission, Joint Research Centre

DOI:

https://doi.org/10.1609/icwsm.v20i1.42790

Abstract

Wikidata has emerged as a major open repository of scholarly metadata, yet its characteristics and limitations as a source for diversity analyses are not well documented. We present DivinWD (DIVersity IN WikiData), a curated dataset consisting of more than 23 million triples, produced via a fully open-source processing pipeline to enable the study of diversity in scientific publications represented in Wikidata. The dataset comprises over 1.2 million scholarly articles published between 2010 and 2024, enriched by integrating Wikidata with 5 external bibliographic sources--Crossref, Dimensions, OpenAlex, Scopus, and Semantic Scholar--and augmenting them using the Genderize API, to enrich metadata on language, field of study, authorship, gender, geographic origin, and institutional affiliation. Our analysis documents systematic coverage biases and infrastructural artifacts affecting Wikidata's scholarly content, highlighting important considerations for reuse. By releasing the dataset and pipeline, this work provides a transparent foundation for future research on diversity in science and for the development and evaluation of open, reproducible bibliometric indicators.

Downloads

Published

2026-05-25

How to Cite

Saletti, Z., Consonni, C., Frau-Amar, P., & Gómez, E. (2026). DivinWD: Exploring the Diversity of Scientific Publications in Wikidata. Proceedings of the International AAAI Conference on Web and Social Media, 20(1), 2895–2909. https://doi.org/10.1609/icwsm.v20i1.42790