Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl

Nick Hagar; Jack Bandy

doi:10.1609/icwsm.v19i1.35948

Authors

Nick Hagar Northwestern University
Jack Bandy Transylvania University

DOI:

https://doi.org/10.1609/icwsm.v19i1.35948

Abstract

Large language models (LLMs) rely heavily on web-derived training datasets, yet understanding how filtering and curation decisions affect these datasets remains challenging. This paper presents two complementary datasets designed to enable systematic analysis of LLM training data composition. The first dataset captures domain-level statistics across 96 Common Crawl snapshots, providing baseline data about web content distribution before filtering. The second dataset contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), allowing researchers to analyze how different filtering approaches affect content inclusion. By making these datasets publicly available in a consistent format, we aim to (1) facilitate research into training data composition, (2) enable systematic auditing of filtering effects, and (3) support more transparent approaches to dataset development. Our datasets can help researchers investigate questions related to content diversity, source representation, and the impact of different filtering decisions on training data composition. Overall, this work provides a foundation for understanding how curation choices shape the content that ultimately trains widely-deployed language models.

Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information