Clustering - What Both Theoreticians and Practitioners Are Doing Wrong
Keywords:clustering, theory, practice, bias, challenges
Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowadays. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. The first challenge I address is model selection---how should a user pick an appropriate clustering tool for a given clustering problem, and how should the parameters of such an algorithmic tool be tuned? In contrast with other common computational tasks, for clustering, different algorithms often yield drastically different outcomes. Therefore, the choice of a clustering algorithm may play a crucial role in the usefulness of an output clustering solution. However, currently there exists no methodical guidance for clustering tool selection for a given clustering task. I argue the severity of this problem and describe some recent proposals aiming to address this crucial lacuna.