Efficient Clustering of Short Messages into General Domains

Authors

  • Oren Tsur The Hebrew University
  • Adi Littman The Hebrew University
  • Ari Rappoport The Hebrew University

DOI:

https://doi.org/10.1609/icwsm.v7i1.14420

Keywords:

Twitter, clustering, micro messages, big data, scale

Abstract

The ever increasing activity in social networks is mainly manifested by a growing stream of status updating or microblogging. The massive stream of updates emphasizes the need for accurate and efficient clustering of short messages on a large scale. Applying traditional clustering techniques is both inaccurate and inefficient due to sparseness. This paper presents an accurate and efficient algorithm for clustering Twitter tweets. We break the clustering task into two distinctive tasks/stages: (1) batch clustering of user annotated data, and (2) online clustering of a stream of tweets. In the first stage we rely on the habit of `tagging', common in social media streams (e.g. hashtags), thus the algorithm can bootstrap on the tags for clustering of a large pool of hashtagged tweets. The stable clusters achieved in the first stage lend themselves for online clustering of a stream of (mostly) tagless messages.
We evaluate our results against gold-standard classification and validate the results by employing multiple clusteringevaluation measures (information theoretic, paired, F and greedy). We compare our algorithm to a number of otherclustering algorithms and various types of feature sets. Results show that the algorithm presented is both accurate andefficient and can be easily used for large scale clustering of sparse messages as the heavy lifting is achieved ona sublinear number of documents.

Downloads

Published

2021-08-03

How to Cite

Tsur, O., Littman, A., & Rappoport, A. (2021). Efficient Clustering of Short Messages into General Domains. Proceedings of the International AAAI Conference on Web and Social Media, 7(1), 621-630. https://doi.org/10.1609/icwsm.v7i1.14420