Media Cloud 2.0: An Updated Open Web News Archive

Authors

  • Fernando Bermejo Media Ecosystems Analysis Group
  • Rahul Bhargava Northeastern University
  • Phil Budne Media Ecosystems Analysis Group
  • Paige Gulley Media Ecosystems Analysis Group
  • Evan Leon Northeastern University
  • Ryan McGrady University of Massachusetts Amherst
  • Emily Boardman Ndulue Media Ecosystems Analysis Group
  • Ethan Zuckerman University of Massachusetts Amherst

DOI:

https://doi.org/10.1609/icwsm.v20i1.42778

Abstract

We present a completely re-engineered Media Cloud, a massive searchable open source archive of digital news sources and content from around the globe. Since its previous presentation at ICWSM in 2021, the Media Cloud team has re-engineered the tool's data collection, storage, and retrieval systems, built a new front-end research interface, surpassed 1.8 billion stories, and reprocessed all the content to update the extracted metadata with consistent and modern techniques. In this paper we document the new system’s engineering, characterize the datasets to date, and describe user-facing tools. This includes a Directory of online news sources and a searchable Story Index of global news stories. We discuss the utility of the datasets, how they compare to other related work, challenges associated with maintaining open research infrastructure, and research made possible through the datasets and tooling.

Downloads

Published

2026-05-25

How to Cite

Bermejo, F., Bhargava, R., Budne, P., Gulley, P., Leon, E., McGrady, R., … Zuckerman, E. (2026). Media Cloud 2.0: An Updated Open Web News Archive. Proceedings of the International AAAI Conference on Web and Social Media, 20(1), 2735–2746. https://doi.org/10.1609/icwsm.v20i1.42778