Inferring Gender from the Content of Tweets: A Region Specific Example
Keywords:social media, twitter, gender, demography
There is growing interest in using social networking sites such as Twitter to gather real-time data on the reactions and opinions of a region's population, including locations in the developing world where social media has played an important role in recent events, such as the 2011 Arab Spring. However, many interesting and important opinions and reactions may differ significantly within a given region depending on the demographics of the subpopulation, including such categories as gender and ethnicity. Unfortunately, the demographic characteristics of social media users are often unknown because such categories are not always captured in user metadata. Twitter, for example, does not capture a user’s gender in their profile, and inferring gender from first names is difficult since Twitter users are not required to give their real names. There is thus a need for automated methods that can infer such hidden attributes of users from other data sources. In this paper we describe a method to infer the gender of Twitter users from only the content of their tweets. Looking at Twitter users from the West African nation of Nigeria, we applied supervised machine learning using features derived from the content of user tweets to train a classifier. Using unigram features alone, we obtained an accuracy of 80% for predicting gender, suggesting that content alone can be a good predictor of gender. An analysis of the highest weighted features shows some interesting distinctions between men and women both topically and emotionally. We argue that approaches such as the one described here can give us a clearer picture of who is utilizing social media when certain user attributes are unreliable or not available.