Hagar, Nick, and Jack Bandy. “Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl”. Proceedings of the International AAAI Conference on Web and Social Media 19, no. 1 (June 7, 2025): 2454–2464. Accessed May 9, 2026. https://ojs.aaai.org/index.php/ICWSM/article/view/35948.