SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata

Authors

  • Mark Diaz Google Research
  • Sunipa Dev Google Research
  • Emily Reif Google Research
  • Emily Denton Google Research
  • Vinodkumar Prabhakaran Google Research

Abstract

Decisions about how to responsibly collect, use and document data often rely upon understanding how people are represented in data. Yet, the unlabeled nature and scale of data used in foundation model development poses a direct challenge to systematic analyses of downstream risks, such as representational harms. We provide a framework designed to help RAI practitioners more easily plan and structure analyses of how people are represented in unstructured data and identify downstream risks. The framework is organized into groups of analyses that map to 3 basic questions: 1) Who is represented in the data, 2) What content is in the data, and 3) How are the two associated. We use the framework to analyze human representation in two commonly used datasets: the Common Crawl web corpus (C4) of 356 billion tokens, and the LAION-400M dataset of 400 million text-image pairs, both developed in the English language. We illustrate how the framework informs action steps for hypothetical teams faced with data use, development, and documentation decisions. Ultimately, the framework structures human representation analyses and maps out analysis planning considerations, goals, and risk mitigation actions at different stages of dataset and model development.

Downloads

Published

2024-10-16