A Comparative Study of Demographic Attribute Inference in Twitter
Social media platforms have become a major gateway to receive and analyze public opinions. Understandingusers can provide invaluable context information of their social media posts and significantly improve traditional opinion analysis models. Demographic attributes,such as ethnicity, gender, age, among others,have been extensively applied to characterize social mediausers. While studies have shown that user groups formed by demographic attributes can have coherent opinions towards political issues, these attributes are often not explicitly coded by users through their profiles.Previous work has demonstrated the effectiveness of different user signals such as users’ posts and names in determining demographic attributes. Yet, these efforts mostly evaluate linguistic signals from users’ postsand train models from artificially balanced datasets. In this paper, we propose a comprehensive list of user signals:self-descriptions and posts aggregated from users’ friends and followers, users’ profile images, and users’ names.We provide a comparative study of these signalsside-by-side in the tasks on inferring three major demographic attributes, namely ethnicity, gender, and age.We utilize a realistic unbalanced datasets that share similar demographic makeups in Twitter for training modelsand evaluation experiments. Our experiments indicate that self-descriptions provide the strongest signal for ethnicity and age inference and clearly improve the overall performance when combined with tweets. Profile images for gender inference have the highest precision score with overall score close to the best result in our setting. This suggests that signals in self descriptions and profile images have potentials to facilitate demographic attribute inferences in Twitter, and are promising for future investigation.