Hagar, Nick, and Jack Bandy. “Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl”. Proceedings of the International AAAI Conference on Web and Social Media, vol. 19, no. 1, June 2025, pp. 2454-6, doi:10.1609/icwsm.v19i1.35948.