Web2Wiki: Characterizing Wikipedia Linking Across the Web

Authors

  • Veniamin Veselovsky Princeton University
  • Tiziano Piccardi Johns Hopkins University
  • Ashton Anderson University of Toronto
  • Robert West EPFL
  • Akhil Arora Aarhus University

DOI:

https://doi.org/10.1609/icwsm.v20i1.42754

Abstract

Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals four key findings: (1) The topics of Wikipedia articles referenced on the Web differ from those cited on Reddit and those prominent within Wikipedia’s own link structure, (2) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (3) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (4) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia’s function as a background knowledge provider. While this study focuses on English Wikipedia, our publicly released WEB2WIKI dataset includes links from multiple language editions, supporting future research on Wikipedia’s global influence on the Web.

Downloads

Published

2026-05-25

How to Cite

Veselovsky, V., Piccardi, T., Anderson, A., West, R., & Arora, A. (2026). Web2Wiki: Characterizing Wikipedia Linking Across the Web. Proceedings of the International AAAI Conference on Web and Social Media, 20(1), 2345–2357. https://doi.org/10.1609/icwsm.v20i1.42754