Skip to content

Commit

Permalink
GitBook: No commit message
Browse files Browse the repository at this point in the history
  • Loading branch information
derrickburns authored and gitbook-bot committed Jan 18, 2024
1 parent 08bc609 commit 043e285
Show file tree
Hide file tree
Showing 23 changed files with 726 additions and 0 deletions.
12 changes: 12 additions & 0 deletions src/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Introduction

The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center.

The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster.

The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters.

A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost". 

A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points.

24 changes: 24 additions & 0 deletions src/docs/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Table of contents

* [Introduction](README.md)
* [Relation to Spark K-Means Clusterer](readme/relation-to-spark-k-means-clusterer.md)
* [Algorithms Implemented](readme/algorithms-implemented.md)
* [Requirements](requirements.md)
* [Quick Start](quick-start.md)
* [Concepts](concepts/README.md)
* [Bregman Divergence](concepts/bregman-divergence.md)
* [WeightedVector](concepts/weightedvector.md)
* [BregmanPoint, BregmanCenter, BregmanPointOps](concepts/bregmanpoint-bregmancenter-bregmanpointops.md)
* [KMeansModel](concepts/kmeansmodel.md)
* [MultiKMeansClusterer](concepts/multikmeansclusterer.md)
* [KMeansSelector](concepts/kmeansselector.md)
* [Usage](usage/README.md)
* [Selecting a Distance Function](usage/selecting-a-distance-function.md)
* [Constructing K-Means Models using Clusterers](usage/constructing-k-means-models-using-clusterers.md)
* [Embedding Data](usage/embedding-data.md)
* [Seeding the Set of Cluster Centers](usage/seeding-the-set-of-cluster-centers.md)
* [Iterative Clustering](usage/iterative-clustering.md)
* [Alternative KMeansModel Construction](usage/alternative-kmeansmodel-construction.md)
* [Customizing](usage/customizing/README.md)
* [Creating a Custom Distance Function](usage/customizing/creating-a-custom-distance-function.md)
* [Creating a Custom Embedding](usage/customizing/creating-a-custom-embedding.md)
2 changes: 2 additions & 0 deletions src/docs/concepts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Concepts

14 changes: 14 additions & 0 deletions src/docs/concepts/bregman-divergence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Bregman Divergence

While one can assign a point to a cluster using any distance function, Lloyd's algorithm only converges for a certain set of distance functions called [Bregman divergences](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf). Bregman divergences must define two methods, `convex` to evaluate a function on a point and `gradientOfConvex` to evaluate the gradient of the function on a point.

```scala
package com.massivedatascience.divergence

trait BregmanDivergence {
def convex(v: Vector): Double
def gradientOfConvex(v: Vector): Vector
}
```

For example, by defining `convex` to be the squared vector norm (i.e. the sum of the squares of the coordinates), one gets a distance function that equals the square of the well known Euclidean distance. 
42 changes: 42 additions & 0 deletions src/docs/concepts/bregmanpoint-bregmancenter-bregmanpointops.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# BregmanPoint, BregmanCenter, BregmanPointOps

For efficient repeated computation of distance between a fixed set of points and varying cluster centers, is it convenient to pre-compute certain information and associate that information with the point or the cluster center. The class that represent an enriched point is `BregmanPoint`. The class that represent the enriched cluster center is `BregmanCenter`. Users of this package do not construct instances of these objects directly.

```scala
package com.massivedatascience.divergence

trait BregmanPoint

trait BregmanCenter
```

We enrich a Bregman divergence with a set of commonly used operations, including factory methods `toPoint` and `toCenter` to construct instances of the aforementioned `BregmanPoint` and `BregmanCenter`.

The enriched trait is the `BregmanPointOps`.

```scala
package com.massivedatascience.clusterer

trait BregmanPointOps {
type P = BregmanPoint
type C = BregmanCenter

val divergence: BregmanDivergence

def toPoint(v: WeightedVector): P

def toCenter(v: WeightedVector): C

def centerMoved(v: P, w: C): Boolean

def findClosest(centers: IndexedSeq[C], point: P): (Int, Double)

def findClosestCluster(centers: IndexedSeq[C], point: P): Int

def distortion(data: RDD[P], centers: IndexedSeq[C])

def pointCost(centers: IndexedSeq[C], point: P): Double

def distance(p: BregmanPoint, c: BregmanCenter): Double
}
```
43 changes: 43 additions & 0 deletions src/docs/concepts/kmeansmodel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# KMeansModel

A K-means model is a set of cluster centers. We abstract the K-means model with the `KMeansModel` trait with methods to map an arbitrary point (viz. `Vector`, `WeightedVector`, or `BregmanPoint`) to the nearest cluster center and to compute the cost/distance to that center. 

```scala
package com.massivedatascience.clusterer

trait KMeansModel {

val pointOps: BregmanPointOps

def centers: IndexedSeq[BregmanCenter]


def predict(point: Vector): Int

def predictClusterAndDistance(point: Vector): (Int, Double)

def predict(points: RDD[Vector]): RDD[Int]

def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer]

def computeCost(data: RDD[Vector]): Double


def predictWeighted(point: WeightedVector): Int

def predictClusterAndDistanceWeighted(point: WeightedVector): (Int, Double)

def predictWeighted(points: RDD[WeightedVector]): RDD[Int]

def computeCostWeighted(data: RDD[WeightedVector]): Double


def predictBregman(point: BregmanPoint): Int

def predictClusterAndDistanceBregman(point: BregmanPoint): (Int, Double)

def predictBregman(points: RDD[BregmanPoint]): RDD[Int]

def computeCostBregman(data: RDD[BregmanPoint): Double
}
```
17 changes: 17 additions & 0 deletions src/docs/concepts/kmeansselector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# KMeansSelector

The initial selection of cluster centers is called the initialization step. We abstract implementations of the initialization step with the `KMeansSelector` trait.

```scala
package com.massivedatascience.clusterer

trait KMeansSelector extends Serializable {
def init(
ops: BregmanPointOps,
d: RDD[BregmanPoint],
numClusters: Int,
initialInfo: Option[(Seq[IndexedSeq[BregmanCenter]], Seq[RDD[Double]])] = None,
runs: Int,
seed: Long): Seq[IndexedSeq[BregmanCenter]]
}
```
21 changes: 21 additions & 0 deletions src/docs/concepts/multikmeansclusterer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# MultiKMeansClusterer

Lloyd's algorithm is simple to describe, but in practice different implementations are possible that can yield dramatically different running times depending on the data being clusters. We abstract the clusterer using the `MultiKMeansClusterer` trait.

```scala
trait MultiKMeansClusterer extends Serializable with Logging {
def cluster(
maxIterations: Int,
pointOps: BregmanPointOps,
data: RDD[BregmanPoint],
centers: Seq[IndexedSeq[BregmanCenter]]): Seq[(Double, IndexedSeq[BregmanCenter])]

def best(
maxIterations: Int,
pointOps: BregmanPointOps,
data: RDD[BregmanPoint],
centers: Seq[IndexedSeq[BregmanCenter]]): (Double, IndexedSeq[BregmanCenter]) = {
cluster(maxIterations, pointOps, data, centers).minBy(_._1)
}
}
```
32 changes: 32 additions & 0 deletions src/docs/concepts/weightedvector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# WeightedVector

Often, data points that are clustered have varying significance, i.e. they are weighted. This clusterer operates on weighted vectors. Use these `WeightedVector` companion object to construct weighted vectors.

```scala
package com.massivedatascience.linalg

trait WeightedVector extends Serializable {
def weight: Double

def inhomogeneous: Vector

def homogeneous: Vector

def size: Int = homogeneous.size
}

object WeightedVector {

def apply(v: Vector): WeightedVector = ???

def apply(v: Array[Double]): WeightedVector = ???

def apply(v: Vector, weight: Double): WeightedVector = ???

def apply(v: Array[Double], weight: Double): WeightedVector = ???

def fromInhomogeneousWeighted(v: Array[Double], weight: Double): WeightedVector = ???

def fromInhomogeneousWeighted(v: Vector, weight: Double): WeightedVector = ???
}
```
109 changes: 109 additions & 0 deletions src/docs/quick-start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Quick Start

The simplest way to train a `KMeansModel` on a fixed set of points is to use the `KMeans.train` method. This method is most similar in style to the one provided by the Spark 1.2.0 K-Means clusterer.

For dense data in a low dimension space using the squared Euclidean distance function, one may simply call `KMeans.train` with the data and the desired number of clusters:

```scala
import com.com.massivedatascience.clusterer
import org.apache.spark.ml.linalg.Vector

val model : KMeansModel = KMeans.train(data: RDD[Vector], k: Int)
```

The full signature of the `KMeans.train` method is:

```scala
package com.massivedatascience.clusterer

object KMeans {
/**
*
* Train a K-Means model using Lloyd's algorithm.
*
* @param data input data
* @param k number of clusters desired
* @param maxIterations maximum number of iterations of Lloyd's algorithm
* @param runs number of parallel clusterings to run
* @param mode initialization algorithm to use
* @param distanceFunctionNames the distance functions to use
* @param clustererName which k-means implementation to use
* @param embeddingNames sequence of embeddings to use, from lowest dimension to greatest
* @return K-Means model
*/
def train(
data: RDD[Vector],
k: Int,
maxIterations: Int = KMeans.defaultMaxIterations,
runs: Int = KMeans.defaultNumRuns,
mode: String = KMeansSelector.K_MEANS_PARALLEL,
distanceFunctionNames: Seq[String] = Seq(BregmanPointOps.EUCLIDEAN),
clustererName: String = MultiKMeansClusterer.COLUMN_TRACKING,
embeddingNames: List[String] = List(Embedding.IDENTITY_EMBEDDING)): KMeansModel = ???
}
```

Many of these parameters will be familiar to anyone who is familiar with the Spark 1.1 clusterer.

Similar to the Spark clusterer, we support data provided as `Vectors`, a request for a number `k` of clusters desired, a limit `maxIterations` on the number of iterations of Lloyd's algorithm, and the number of parallel `runs` of the clusterer.

We also offer different initialization `mode`s. But unlike the Spark clusterer, we do not support setting the number of initialization steps for the mode at this level of the interface.

The `K-Means.train` helper methods allows one to name a sequence of embeddings. Several embeddings are provided that may be constructed using the `apply` method of the companion object `Embedding`.

Different distance functions may be used for each embedding. There must be exactly one distance function per embedding provided.

Indeed, the `KMeans.train` helper translates the parameters into a call to the underlying `KMeans.trainWeighted` method.

```scala
package com.massivedatascience.clusterer

object KMeans {
  /**
   *
   * Train a K-Means model using Lloyd's algorithm on WeightedVectors
   *
   * @param data input data
   * @param runConfig run configuration
   * @param pointOps the distance functions to use
   * @param initializer initialization algorithm to use
   * @param embeddings sequence of embeddings to use, from lowest dimension to greatest
   * @param clusterer which k-means implementation to use
   * @return K-Means model
   */

  def trainWeighted(
   runConfig: RunConfig,
   data: RDD[WeightedVector],
   initializer: KMeansSelector,
   pointOps: Seq[BregmanPointOps],
   embeddings: Seq[Embedding],
   clusterer: MultiKMeansClusterer): KMeansModel = ???
  }
}
```

The `KMeans.trainWeighted` method ultimately makes various calls to the underlying `KMeans.simpleTrain` method, which clusters the provided `BregmanPoint`s using the provided `BregmanPointOps` and the provided `KMeansSelector` with the provided `MultiKMeansClusterer`.

```scala
package com.massivedatascience.clusterer

object KMeans {
/**
*
* @param runConfig run configuration
* @param data input data
* @param pointOps the distance functions to use
* @param initializer initialization algorithm to use
* @param clusterer which k-means implementation to use
* @return K-Means model
*/
def simpleTrain(
runConfig: RunConfig,
data: RDD[BregmanPoint],
pointOps: BregmanPointOps,
initializer: KMeansSelector,
clusterer: MultiKMeansClusterer): KMeansModel = ???
}
}
```
14 changes: 14 additions & 0 deletions src/docs/readme/algorithms-implemented.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Algorithms Implemented

Most practical variants of K-means clustering are implemented or can be implemented with this package.

* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf)
* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351)
* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf)
* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf)
* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf)
* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf)
* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf)
* [clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf)

If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!\
26 changes: 26 additions & 0 deletions src/docs/readme/relation-to-spark-k-means-clusterer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Relation to Spark K-Means Clusterer

This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. 

This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!









####







####

```scala
```
3 changes: 3 additions & 0 deletions src/docs/requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Requirements

The massivedatascience-clusterer project is built for Spark 3.4, Scala 2.12, and Java 17.
2 changes: 2 additions & 0 deletions src/docs/usage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Usage

Loading

0 comments on commit 043e285

Please sign in to comment.