Skip to content

Implementation of the kmeans-clustering algorithm using multithreading, both in C++ and Python

Notifications You must be signed in to change notification settings

lucacarniato/k-means-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi process K-Means clustering

This project implements the KMeans clustering algorithm using multiple processes. The following steps are the following:

  • Initialize the centroids for the predefined number of clusters.
  • Calculate the distances of each point from the centroids and assign each point to the closest centroid. All points assigned to the same centroid represent a cluster. This part of the algorithm is the most computing intensive and was parallelized.
  • Updated the coordinates of the centroids with the mean x and y values of all points in the cluster.
  • Iterate from step 2 to 4 until the position of the centroids does not change or the maximum number of iterations is reached.

Centroid initialization follows the k-means++ algorithm (http://ilpubs.stanford.edu:8090/778/), where the centroids positions are initialized as far as possible.

The multiprocess implementation does not scale well, because large arrays between Python processes needs to be serialized into strings of bytes using pickle.

alt text

Acceleration using an external C++ library for distance calculation

To accelerate the algorithm, the distances from the centroids can be calculated in the C++ function "distance_calculator". This function takes the addresses of the numpy arrays and performs the calculations of the distances. The use of a C++ method to calculate the distance gives a considerable acceleration (10 times faster).

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Implementation of the kmeans-clustering algorithm using multithreading, both in C++ and Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published