Distinguishing the Wood from the Trees: Contrasting Collection Methods to Understand Bias in a Longitudinal Brexit Twitter Dataset
Various methods can be used for searching or streaming Twitter data to gather a sample on a specific topic. All of these methods introduce a bias into the resulting datasets. Here we examine, and try to define, the bias that the different strategies introduce. Understanding the bias means that we can extrapolate wider meaning from the data in a more precise manner. We use datasets collected on topics from the UK-EU Brexit referendum conducted in 2016. Each dataset discussed draws data from Twitter over a twelve-month period, from 1st September 2015 until 31st August 2016. Three data collection strategies are considered: collecting on human defined topic specific hashtags; collecting using a semi-automated technique to identify topic terms which are then used to collect tweets; and collecting from predefined users known to be tweeting on the topic. To investigate bias in the data we look at, and find wide variation in: group level metadata attributes such as size of the dataset; number of users in each set; average numbers of friends and followers; likely re-tweet status; and levels of inclusion of various add-ons such as hashtags, URLs and media. We also find that relevance to the topic differs between the sets; being far higher in the known users set. We investigate how readability of tweets within each set varies, particularly between known users and topic term sets. We also find that there is a surprising lack of overlap in the data obtained using different collection methods.