Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario

Karim Lasri; Manuel Tonneau; Haaya Naushan; Niyati Malhotra; Ibrahim Farouq; Víctor Orozco-Olvera; Samuel Fraiberger

doi:10.1609/icwsm.v17i1.22165

Authors

Karim Lasri The World Bank Ecole Normale Superieure
Manuel Tonneau The World Bank University of Oxford
Haaya Naushan The World Bank
Niyati Malhotra The World Bank
Ibrahim Farouq The World Bank Universiti Sultan Zainal Abidin
Víctor Orozco-Olvera The World Bank
Samuel Fraiberger The World Bank New York University Massachusetts Institute of Technology

DOI:

https://doi.org/10.1609/icwsm.v17i1.22165

Keywords:

, Text categorization; topic recognition; demographic/gender/age identification, Social network analysis; communities identification; expertise and authority discovery

Abstract

Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.

Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information