Statewise: Human Identity Investigator for the United States
DOI:
https://doi.org/10.1609/icwsm.v19i1.35949Abstract
Self-reported biographical strings on social media profiles provide a powerful tool to study personal identity. We present Statewise, a dataset based on 50 million unique Twitter user profiles over a 12 year period identified to be in the United States. Users within this dataset can be accurately partitioned into 52 states/territories at each observation, allowing queries into state-specific language choices over time. We report on the major design decisions underlying Statewise, including the methodology behind the location detection system and measurements of user/state transitions across time. We demonstrate the power of Statewise to study the relative prevalences of different token groups, showing clear and consistent regional differences in language usage. We analyze emoji usage by comparing inclusion rates against external state-level statistics, finding that emoji inclusion shares a significant correlation with state unemployment and poverty rates. Finally, we use Gini coefficients as a measure of token usage inequality across all observed territories and demonstrate a clear stratification based on token content.Downloads
Published
2025-06-07
How to Cite
Handzlik, D., Jones, J. J., & Skiena, S. S. (2025). Statewise: Human Identity Investigator for the United States. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2465–2476. https://doi.org/10.1609/icwsm.v19i1.35949
Issue
Section
Dataset Papers