Kmeans-using-MapReduce

Implemented the K-means clustering algorithm using the MapReduce framework, coordinating mappers for parallel processing and reducers for result aggregation. Integrated iterative detection to halt processing upon convergence and ensured fault tolerance by reassigning tasks during failures, achieving a 99% task completion rate.

Tech Stack

Implementation Details

Master
- Responsible for running and communicating with other components in the system.
- Requires the following parameters as input:
  - Number of mappers (M)
  - Number of reducers (R)
  - Number of centroids (K)
  - Number of iterations for K-Means (program stops if the algorithm converges before)
  - Other necessary information
Input Split (invoked by master)
- The master divides the input data into smaller chunks for parallel processing by mappers.
- Each mapper processes a different chunk, with the master attempting to split the data equally based on the number of mappers.
Map (invoked by mapper)
- Applies the Map function to each input split to generate intermediate key-value pairs.
- Each mapper reads its assigned input split independently.
- Inputs to the Map function:
  - Input split assigned by the master
  - List of centroids from the previous iteration
  - Other necessary information
- Output from the Map function:
  - Key: Index of the nearest centroid to the data point
  - Value: The data point itself
- The output is passed to the partition function and written to a partition file in the mapper’s directory.
- Mappers run in parallel as separate processes.
Partition (invoked by mapper)
- Partitions the output of the Map function into smaller partitions.
- Ensures:
  - All key-value pairs for the same key go to the same partition file.
  - Different keys are distributed equally among partitions (e.g., using key % num_reducers).
- Each mapper creates M*R partitions in total.
Shuffle and Sort (invoked by reducer)
- Sorts intermediate key-value pairs by key and groups values belonging to the same key.
- Sends key-value pairs to reducers based on the key.
Reduce (invoked by reducer)
- Receives intermediate key-value pairs, applies shuffle & sort, and produces final key-value pairs.
- Input to the reduce function:
  - Key: Centroid ID
  - Value: List of all data points belonging to this centroid ID
  - Other necessary information
- Output of the reduce function:
  - Key: Centroid ID
  - Value: Updated Centroid
- Each reducer runs as a separate process and writes output to its directory.
Centroid Compilation (invoked by master)
- The master compiles the final list of K centroids from reducer outputs and stores it in a file.
- Before the first iteration, centroids should be randomly selected from input data points.
- The master uses gRPC calls to read output files from reducers, ensuring efficient data handling.
Fault Tolerance and RPCs
- If a mapper or reducer fails, the master re-runs that task to ensure computations finish.
- gRPC communication between components includes:
  - Master ⇔ Mapper (master sends parameters to mapper)
  - Master ⇔ Reducer (master invokes reducers after all mappers return)
  - Reducer ⇔ Mapper (reducers communicate with mappers for input before shuffle, sort, and reduce)
  - Master ⇔ Reducer (master contacts reducers to read output data files after completion)
Print Statements & Log Generation
- During the execution of the master program for each iteration, key data is printed and logged, including:
  - Iteration number
  - Execution of gRPC calls to Mappers or Reducers
  - gRPC responses for each Mapper and Reducer function (SUCCESS/FAILURE)
  - Centroids generated after each iteration (including the randomly initialized centroids)

Installation

Clone the repo

  git clone https://github.com/aryanGupta-09/MapReduce.git

Go to the project directory

  cd MapReduce

Generate the Python code for gRPC

  python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. kmeans.proto

Update Input/points.txt
Initialize the number of mappers and run mapper.py
Initialize the number of reducers and run reducer.py
Initialise the number of centroids, mappers and reducers, and run master.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kmeans-using-MapReduce

Tech Stack

Implementation Details

Installation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Input		Input
README.md		README.md
kmeans.proto		kmeans.proto
mapper.py		mapper.py
master.py		master.py
reducer.py		reducer.py

aryanGupta-09/Kmeans-using-MapReduce

Folders and files

Latest commit

History

Repository files navigation

Kmeans-using-MapReduce

Tech Stack

Implementation Details

Installation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages