SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata

Mark Diaz; Sunipa Dev; Emily Reif; Emily Denton; Vinodkumar Prabhakaran

doi:10.1609/aies.v7i1.31643

Authors

Mark Diaz Google Research
Sunipa Dev Google Research
Emily Reif Google Research
Emily Denton Google Research
Vinodkumar Prabhakaran Google Research

DOI:

https://doi.org/10.1609/aies.v7i1.31643

Abstract

Decisions about how to responsibly collect, use and document data often rely upon understanding how people are represented in data. Yet, the unlabeled nature and scale of data used in foundation model development poses a direct challenge to systematic analyses of downstream risks, such as representational harms. We provide a framework designed to help RAI practitioners more easily plan and structure analyses of how people are represented in unstructured data and identify downstream risks. The framework is organized into groups of analyses that map to 3 basic questions: 1) Who is represented in the data, 2) What content is in the data, and 3) How are the two associated. We use the framework to analyze human representation in two commonly used datasets: the Common Crawl web corpus (C4) of 356 billion tokens, and the LAION-400M dataset of 400 million text-image pairs, both developed in the English language. We illustrate how the framework informs action steps for hypothetical teams faced with data use, development, and documentation decisions. Ultimately, the framework structures human representation analyses and maps out analysis planning considerations, goals, and risk mitigation actions at different stages of dataset and model development.

SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section