Approximate K-Means++ in Sublinear Time

Olivier Bachem; Mario Lucic; S. Hamed Hassani; Andreas Krause

doi:10.1609/aaai.v30i1.10259

Authors

Olivier Bachem Eidgenössische Technische Hochschule Zürich (ETH Zurich)
Mario Lucic Eidgenössische Technische Hochschule Zürich (ETH Zurich)
S. Hamed Hassani Eidgenössische Technische Hochschule Zürich (ETH Zurich)
Andreas Krause Eidgenössische Technische Hochschule Zürich (ETH Zurich)

DOI:

https://doi.org/10.1609/aaai.v30i1.10259

Keywords:

Clustering, K-Means, Large-scale machine learning, Markov Chain Monte Carlo, approximate sampling

Abstract

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, large-scale datasets while offering a reduction in runtime of several orders of magnitude.

Approximate K-Means++ in Sublinear Time

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information