Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario

Authors

  • Karim Lasri The World Bank Ecole Normale Superieure
  • Manuel Tonneau The World Bank University of Oxford
  • Haaya Naushan The World Bank
  • Niyati Malhotra The World Bank
  • Ibrahim Farouq The World Bank Universiti Sultan Zainal Abidin
  • Víctor Orozco-Olvera The World Bank
  • Samuel Fraiberger The World Bank New York University Massachusetts Institute of Technology

DOI:

https://doi.org/10.1609/icwsm.v17i1.22165

Keywords:

, Text categorization; topic recognition; demographic/gender/age identification, Social network analysis; communities identification; expertise and authority discovery

Abstract

Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.

Downloads

Published

2023-06-02

How to Cite

Lasri, K., Tonneau, M., Naushan, H., Malhotra, N., Farouq, I., Orozco-Olvera, V., & Fraiberger, S. (2023). Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario. Proceedings of the International AAAI Conference on Web and Social Media, 17(1), 519-529. https://doi.org/10.1609/icwsm.v17i1.22165