Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl

Authors

  • Nick Hagar Northwestern University
  • Jack Bandy Transylvania University

DOI:

https://doi.org/10.1609/icwsm.v19i1.35948

Abstract

Large language models (LLMs) rely heavily on web-derived training datasets, yet understanding how filtering and curation decisions affect these datasets remains challenging. This paper presents two complementary datasets designed to enable systematic analysis of LLM training data composition. The first dataset captures domain-level statistics across 96 Common Crawl snapshots, providing baseline data about web content distribution before filtering. The second dataset contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), allowing researchers to analyze how different filtering approaches affect content inclusion. By making these datasets publicly available in a consistent format, we aim to (1) facilitate research into training data composition, (2) enable systematic auditing of filtering effects, and (3) support more transparent approaches to dataset development. Our datasets can help researchers investigate questions related to content diversity, source representation, and the impact of different filtering decisions on training data composition. Overall, this work provides a foundation for understanding how curation choices shape the content that ultimately trains widely-deployed language models.

Downloads

Published

2025-06-07

How to Cite

Hagar, N., & Bandy, J. (2025). Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2454–2464. https://doi.org/10.1609/icwsm.v19i1.35948