SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata
DOI:
https://doi.org/10.1609/aies.v7i1.31643Abstract
Decisions about how to responsibly collect, use and document data often rely upon understanding how people are represented in data. Yet, the unlabeled nature and scale of data used in foundation model development poses a direct challenge to systematic analyses of downstream risks, such as representational harms. We provide a framework designed to help RAI practitioners more easily plan and structure analyses of how people are represented in unstructured data and identify downstream risks. The framework is organized into groups of analyses that map to 3 basic questions: 1) Who is represented in the data, 2) What content is in the data, and 3) How are the two associated. We use the framework to analyze human representation in two commonly used datasets: the Common Crawl web corpus (C4) of 356 billion tokens, and the LAION-400M dataset of 400 million text-image pairs, both developed in the English language. We illustrate how the framework informs action steps for hypothetical teams faced with data use, development, and documentation decisions. Ultimately, the framework structures human representation analyses and maps out analysis planning considerations, goals, and risk mitigation actions at different stages of dataset and model development.Downloads
Published
2024-10-16
How to Cite
Diaz, M., Dev, S., Reif, E., Denton, E., & Prabhakaran, V. (2024). SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7(1), 371-383. https://doi.org/10.1609/aies.v7i1.31643
Issue
Section
Full Archival Papers