Statewise: Human Identity Investigator for the United States

Authors

  • Dakota Handzlik Stony Brook University
  • Jason Jeffrey Jones Stony Brook University
  • Steven S. Skiena Stony Brook University

DOI:

https://doi.org/10.1609/icwsm.v19i1.35949

Abstract

Self-reported biographical strings on social media profiles provide a powerful tool to study personal identity. We present Statewise, a dataset based on 50 million unique Twitter user profiles over a 12 year period identified to be in the United States. Users within this dataset can be accurately partitioned into 52 states/territories at each observation, allowing queries into state-specific language choices over time. We report on the major design decisions underlying Statewise, including the methodology behind the location detection system and measurements of user/state transitions across time. We demonstrate the power of Statewise to study the relative prevalences of different token groups, showing clear and consistent regional differences in language usage. We analyze emoji usage by comparing inclusion rates against external state-level statistics, finding that emoji inclusion shares a significant correlation with state unemployment and poverty rates. Finally, we use Gini coefficients as a measure of token usage inequality across all observed territories and demonstrate a clear stratification based on token content.

Downloads

Published

2025-06-07

How to Cite

Handzlik, D., Jones, J. J., & Skiena, S. S. (2025). Statewise: Human Identity Investigator for the United States. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2465–2476. https://doi.org/10.1609/icwsm.v19i1.35949