This algorithm is an improvement to the k-means algorithm by using vectorization methods.
- Decide on a value for k.
- Initialize the k cluster centers.
- Decide the class memberships of the N objects by assigning them to the nearest cluster center.
- Re-estimate the k cluster centers, by assuming the memberships found above are correct.
- If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.
- Randomly select the first centroid from the data points.
- For each data point compute its distance from the nearest, previously chosen centroid.
- Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
- Repeat steps 2 and 3 until k centroids have been sampled
SKLearn’s speed is approximately: 4.5 seconds (on my machine)
Our implementation’s speed is approximately: 1.7 seconds (on my machine)
The difference between the speeds is due to the fact that our implementation is based on Euclidean distance unlike the existing implementation.