TwitchChat: A Dataset for Exploring Livestream Chat
Most natural language processing research focuses on modelling and understanding text formed of complete sentences with correct spelling and grammar. However, livestream chat is drastically different. Viewers are typically writing short messages while responding to in-stream events, often with incorrect grammar and many repeated tokens. Additionally, tokens that are commonly used in livestream chat are unknown to traditional language understanding efforts that focus on prosaic text. To advance and encourage further research in terms of livestream chat understanding, in this work, we present a large-scale dataset of video game livestream chat, consisting of over 60 million tokens. As livestreaming becomes more popular it is also increasingly pertinent to study, though chat analysis, the way in which the audience is engaging with the stream. However, this is not a straightforward task, livestream chat is a rich and complex domain, far removed from often studied prosaic text. Additionally. we provide a case study analysis of word vector methods applied to the dataset, showing that the vector space is strangely shaped but clusterable and that the resulting clusters correlate with features such as streamer popularity. Furthermore, human relatedness tests highlight the difference that this domain poses with respect to prosaic text. It is hoped the livestream chat dataset, the discussion of its unique features, and the challenges highlighted for future work will invigorate the research community into further study of livestream chat.