GitBook: No commit message

derrickburns · Jan 18, 2024 · 043e285 · 043e285
1 parent 08bc609
commit 043e285
Show file tree

Hide file tree

Showing 23 changed files with 726 additions and 0 deletions.
diff --git a/src/docs/README.md b/src/docs/README.md
@@ -0,0 +1,12 @@
+# Introduction
+
+The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center.
+
+The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster.
+
+The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters.
+
+A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost".&#x20;
+
+A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points.
+
diff --git a/src/docs/SUMMARY.md b/src/docs/SUMMARY.md
@@ -0,0 +1,24 @@
+# Table of contents
+
+* [Introduction](README.md)
+  * [Relation to Spark K-Means Clusterer](readme/relation-to-spark-k-means-clusterer.md)
+  * [Algorithms Implemented](readme/algorithms-implemented.md)
+* [Requirements](requirements.md)
+* [Quick Start](quick-start.md)
+* [Concepts](concepts/README.md)
+  * [Bregman Divergence](concepts/bregman-divergence.md)
+  * [WeightedVector](concepts/weightedvector.md)
+  * [BregmanPoint, BregmanCenter, BregmanPointOps](concepts/bregmanpoint-bregmancenter-bregmanpointops.md)
+  * [KMeansModel](concepts/kmeansmodel.md)
+  * [MultiKMeansClusterer](concepts/multikmeansclusterer.md)
+  * [KMeansSelector](concepts/kmeansselector.md)
+* [Usage](usage/README.md)
+  * [Selecting a Distance Function](usage/selecting-a-distance-function.md)
+  * [Constructing K-Means Models using Clusterers](usage/constructing-k-means-models-using-clusterers.md)
+  * [Embedding Data](usage/embedding-data.md)
+  * [Seeding the Set of Cluster Centers](usage/seeding-the-set-of-cluster-centers.md)
+  * [Iterative Clustering](usage/iterative-clustering.md)
+  * [Alternative KMeansModel Construction](usage/alternative-kmeansmodel-construction.md)
+  * [Customizing](usage/customizing/README.md)
+    * [Creating a Custom Distance Function](usage/customizing/creating-a-custom-distance-function.md)
+    * [Creating a Custom Embedding](usage/customizing/creating-a-custom-embedding.md)
diff --git a/src/docs/concepts/README.md b/src/docs/concepts/README.md
@@ -0,0 +1,2 @@
+# Concepts
+
diff --git a/src/docs/concepts/bregman-divergence.md b/src/docs/concepts/bregman-divergence.md
@@ -0,0 +1,14 @@
+# Bregman Divergence
+
+While one can assign a point to a cluster using any distance function, Lloyd's algorithm only converges for a certain set of distance functions called [Bregman divergences](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf). Bregman divergences must define two methods, `convex` to evaluate a function on a point and `gradientOfConvex` to evaluate the gradient of the function on a point.
+
+```scala
+package com.massivedatascience.divergence
+
+trait BregmanDivergence {
+  def convex(v: Vector): Double
+  def gradientOfConvex(v: Vector): Vector
+}
+```
+
+For example, by defining `convex` to be the squared vector norm (i.e. the sum of the squares of the coordinates), one gets a distance function that equals the square of the well known Euclidean distance.&#x20;
diff --git a/src/docs/concepts/bregmanpoint-bregmancenter-bregmanpointops.md b/src/docs/concepts/bregmanpoint-bregmancenter-bregmanpointops.md
@@ -0,0 +1,42 @@
+# BregmanPoint, BregmanCenter, BregmanPointOps
+
+For efficient repeated computation of distance between a fixed set of points and varying cluster centers, is it convenient to pre-compute certain information and associate that information with the point or the cluster center. The class that represent an enriched point is `BregmanPoint`. The class that represent the enriched cluster center is `BregmanCenter`. Users of this package do not construct instances of these objects directly.
+
+```scala
+package com.massivedatascience.divergence
+
+trait BregmanPoint
+
+trait BregmanCenter
+```
+
+We enrich a Bregman divergence with a set of commonly used operations, including factory methods `toPoint` and `toCenter` to construct instances of the aforementioned `BregmanPoint` and `BregmanCenter`.
+
+The enriched trait is the `BregmanPointOps`.
+
+```scala
+package com.massivedatascience.clusterer
+
+trait BregmanPointOps  {
+  type P = BregmanPoint
+  type C = BregmanCenter
+
+  val divergence: BregmanDivergence
+
+  def toPoint(v: WeightedVector): P
+
+  def toCenter(v: WeightedVector): C
+
+  def centerMoved(v: P, w: C): Boolean
+
+  def findClosest(centers: IndexedSeq[C], point: P): (Int, Double)
+
+  def findClosestCluster(centers: IndexedSeq[C], point: P): Int
+
+  def distortion(data: RDD[P], centers: IndexedSeq[C])
+
+  def pointCost(centers: IndexedSeq[C], point: P): Double
+
+  def distance(p: BregmanPoint, c: BregmanCenter): Double
+}
+```
diff --git a/src/docs/concepts/kmeansmodel.md b/src/docs/concepts/kmeansmodel.md
@@ -0,0 +1,43 @@
+# KMeansModel
+
+A K-means model is a set of cluster centers.  We abstract the K-means model with the `KMeansModel` trait with methods to map an arbitrary point (viz. `Vector`, `WeightedVector`, or `BregmanPoint`) to the nearest cluster center and to compute the cost/distance to that center.&#x20;
+
+```scala
+package com.massivedatascience.clusterer
+
+trait KMeansModel {
+
+  val pointOps: BregmanPointOps
+
+  def centers: IndexedSeq[BregmanCenter]
+
+
+  def predict(point: Vector): Int
+
+  def predictClusterAndDistance(point: Vector): (Int, Double)
+
+  def predict(points: RDD[Vector]): RDD[Int]
+
+  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer]
+
+  def computeCost(data: RDD[Vector]): Double
+
+
+  def predictWeighted(point: WeightedVector): Int
+
+  def predictClusterAndDistanceWeighted(point: WeightedVector): (Int, Double)
+
+  def predictWeighted(points: RDD[WeightedVector]): RDD[Int]
+
+  def computeCostWeighted(data: RDD[WeightedVector]): Double
+
+
+  def predictBregman(point: BregmanPoint): Int
+
+  def predictClusterAndDistanceBregman(point: BregmanPoint): (Int, Double)
+
+  def predictBregman(points: RDD[BregmanPoint]): RDD[Int]
+
+  def computeCostBregman(data: RDD[BregmanPoint): Double
+}
+```
diff --git a/src/docs/concepts/kmeansselector.md b/src/docs/concepts/kmeansselector.md
@@ -0,0 +1,17 @@
+# KMeansSelector
+
+The initial selection of cluster centers is called the initialization step. We abstract implementations of the initialization step with the `KMeansSelector` trait.
+
+```scala
+package com.massivedatascience.clusterer
+
+trait KMeansSelector extends Serializable {
+  def init(
+    ops: BregmanPointOps,
+    d: RDD[BregmanPoint],
+    numClusters: Int,
+    initialInfo: Option[(Seq[IndexedSeq[BregmanCenter]], Seq[RDD[Double]])] = None,
+    runs: Int,
+    seed: Long): Seq[IndexedSeq[BregmanCenter]]
+}
+```
diff --git a/src/docs/concepts/multikmeansclusterer.md b/src/docs/concepts/multikmeansclusterer.md
@@ -0,0 +1,21 @@
+# MultiKMeansClusterer
+
+Lloyd's algorithm is simple to describe, but in practice different implementations are possible that can yield dramatically different running times depending on the data being clusters. We abstract the clusterer using the `MultiKMeansClusterer` trait.
+
+```scala
+trait MultiKMeansClusterer extends Serializable with Logging {
+  def cluster(
+    maxIterations: Int,
+    pointOps: BregmanPointOps,
+    data: RDD[BregmanPoint],
+    centers: Seq[IndexedSeq[BregmanCenter]]): Seq[(Double, IndexedSeq[BregmanCenter])]
+
+  def best(
+    maxIterations: Int,
+    pointOps: BregmanPointOps,
+    data: RDD[BregmanPoint],
+    centers: Seq[IndexedSeq[BregmanCenter]]): (Double, IndexedSeq[BregmanCenter]) = {
+    cluster(maxIterations, pointOps, data, centers).minBy(_._1)
+  }
+}
+```
diff --git a/src/docs/concepts/weightedvector.md b/src/docs/concepts/weightedvector.md
@@ -0,0 +1,32 @@
+# WeightedVector
+
+Often, data points that are clustered have varying significance, i.e. they are weighted. This clusterer operates on weighted vectors. Use these `WeightedVector` companion object to construct weighted vectors.
+
+```scala
+package com.massivedatascience.linalg
+
+trait WeightedVector extends Serializable {
+  def weight: Double
+
+  def inhomogeneous: Vector
+
+  def homogeneous: Vector
+
+  def size: Int = homogeneous.size
+}
+
+object WeightedVector {
+
+  def apply(v: Vector): WeightedVector = ???
+
+  def apply(v: Array[Double]): WeightedVector = ???
+
+  def apply(v: Vector, weight: Double): WeightedVector = ???
+
+  def apply(v: Array[Double], weight: Double): WeightedVector = ???
+
+  def fromInhomogeneousWeighted(v: Array[Double], weight: Double): WeightedVector = ???
+
+  def fromInhomogeneousWeighted(v: Vector, weight: Double): WeightedVector = ???
+}
+```
diff --git a/src/docs/quick-start.md b/src/docs/quick-start.md
@@ -0,0 +1,109 @@
+# Quick Start
+
+The simplest way to train a `KMeansModel` on a fixed set of points is to use the `KMeans.train` method. This method is most similar in style to the one provided by the Spark 1.2.0 K-Means clusterer. 
+
+For dense data in a low dimension space using the squared Euclidean distance function, one may simply call `KMeans.train` with the data and the desired number of clusters:
+
+```scala
+import com.com.massivedatascience.clusterer
+import org.apache.spark.ml.linalg.Vector
+
+val model : KMeansModel = KMeans.train(data: RDD[Vector], k: Int)
+```
+
+The full signature of the `KMeans.train` method is:
+
+```scala
+package com.massivedatascience.clusterer
+
+object KMeans {
+  /**
+   *
+   * Train a K-Means model using Lloyd's algorithm.
+   *
+   * @param data input data
+   * @param k  number of clusters desired
+   * @param maxIterations maximum number of iterations of Lloyd's algorithm
+   * @param runs number of parallel clusterings to run
+   * @param mode initialization algorithm to use
+   * @param distanceFunctionNames the distance functions to use
+   * @param clustererName which k-means implementation to use
+   * @param embeddingNames sequence of embeddings to use, from lowest dimension to greatest
+   * @return K-Means model
+   */
+  def train(
+    data: RDD[Vector],
+    k: Int,
+    maxIterations: Int = KMeans.defaultMaxIterations,
+    runs: Int = KMeans.defaultNumRuns,
+    mode: String = KMeansSelector.K_MEANS_PARALLEL,
+    distanceFunctionNames: Seq[String] = Seq(BregmanPointOps.EUCLIDEAN),
+    clustererName: String = MultiKMeansClusterer.COLUMN_TRACKING,
+    embeddingNames: List[String] = List(Embedding.IDENTITY_EMBEDDING)): KMeansModel = ???
+}
+```
+
+Many of these parameters will be familiar to anyone who is familiar with the Spark 1.1 clusterer.
+
+Similar to the Spark clusterer, we support data provided as `Vectors`, a request for a number `k` of clusters desired, a limit `maxIterations` on the number of iterations of Lloyd's algorithm, and the number of parallel `runs` of the clusterer.
+
+We also offer different initialization `mode`s. But unlike the Spark clusterer, we do not support setting the number of initialization steps for the mode at this level of the interface.
+
+The `K-Means.train` helper methods allows one to name a sequence of embeddings. Several embeddings are provided that may be constructed using the `apply` method of the companion object `Embedding`.
+
+Different distance functions may be used for each embedding. There must be exactly one distance function per embedding provided.
+
+Indeed, the `KMeans.train` helper translates the parameters into a call to the underlying `KMeans.trainWeighted` method.
+
+```scala
+package com.massivedatascience.clusterer
+
+object KMeans {
+  /**
+   *
+   * Train a K-Means model using Lloyd's algorithm on WeightedVectors
+   *
+   * @param data input data
+   * @param runConfig run configuration
+   * @param pointOps the distance functions to use
+   * @param initializer initialization algorithm to use
+   * @param embeddings sequence of embeddings to use, from lowest dimension to greatest
+   * @param clusterer which k-means implementation to use
+   * @return K-Means model
+   */
+
+  def trainWeighted(
+    runConfig: RunConfig,
+    data: RDD[WeightedVector],
+    initializer: KMeansSelector,
+    pointOps: Seq[BregmanPointOps],
+    embeddings: Seq[Embedding],
+    clusterer: MultiKMeansClusterer): KMeansModel = ???
+  }
+}
+```
+
+The `KMeans.trainWeighted` method ultimately makes various calls to the underlying `KMeans.simpleTrain` method, which clusters the provided `BregmanPoint`s using the provided `BregmanPointOps` and the provided `KMeansSelector` with the provided `MultiKMeansClusterer`.
+
+```scala
+package com.massivedatascience.clusterer
+
+object KMeans {
+  /**
+   *
+   * @param runConfig run configuration
+   * @param data input data
+   * @param pointOps the distance functions to use
+   * @param initializer initialization algorithm to use
+   * @param clusterer which k-means implementation to use
+   * @return K-Means model
+   */
+  def simpleTrain(
+    runConfig: RunConfig,
+    data: RDD[BregmanPoint],
+    pointOps: BregmanPointOps,
+    initializer: KMeansSelector,
+    clusterer: MultiKMeansClusterer): KMeansModel = ???
+    }
+}
+```
diff --git a/src/docs/readme/algorithms-implemented.md b/src/docs/readme/algorithms-implemented.md
@@ -0,0 +1,14 @@
+# Algorithms Implemented
+
+Most practical variants of K-means clustering are implemented or can be implemented with this package.
+
+* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf)
+* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351)
+* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf)
+* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf)
+* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf)
+* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf)
+* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf)
+* [clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf)
+
+If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!\
diff --git a/src/docs/readme/relation-to-spark-k-means-clusterer.md b/src/docs/readme/relation-to-spark-k-means-clusterer.md
@@ -0,0 +1,26 @@
+# Relation to Spark K-Means Clusterer
+
+This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer.&#x20;
+
+This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!
+
+
+
+
+
+
+
+
+
+####
+
+
+
+
+
+
+
+####
+
+```scala
+```
diff --git a/src/docs/requirements.md b/src/docs/requirements.md
@@ -0,0 +1,3 @@
+# Requirements
+
+The massivedatascience-clusterer project is built for Spark 3.4, Scala 2.12, and Java 17.
diff --git a/src/docs/usage/README.md b/src/docs/usage/README.md
@@ -0,0 +1,2 @@
+# Usage
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Requirements

		The massivedatascience-clusterer project is built for Spark 3.4, Scala 2.12, and Java 17.