Corelab Seminar

Loukas Kavouras (NTUA)
Algorithms for the k-means Clustering Problem

Clustering is the task of grouping a set of objects in such a way that objects in the same cluster are more similar to each other than to those in other clusters. It is a main task of data mining, machine learning and computational geometry. In this presentation, we discuss famous clustering problems and we emphasize on the k-means clustering problem, where one seeks to partition n observations into k clusters so as to minimize the within-cluster sum of squares. We present Lloyd's algorithm for the k-means problem, which was identified as one of the top 10 algorithms in data mining. Although Lloyd's algorithm has an exponential running time in the worst case, it usually runs fast in many practical applications. However, the algorithm gives no guarantees and there are natural examples where it may produce arbitrarily bad clusterings. k-means++ algorithm addresses this problem by augmenting Lloyd's algorithm with a simple and intuitive seeding technique. A formal proof shows that k-means++ algorithm is O(log k) competitive. Next, we consider cases where the entire input is not available from the beginning. That is, we examine an algorithm for k-means in the streaming model, where the data is too large to be stored in main memory and must be accessed sequentially.