Sunday, 10 July 2016

K-means Program with spark and python

K-means
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The spark.mllib implementation includes a parallelized variant of the k-means++ method called kmeans. The implementation in spark.mllib has the following parameters:
  • k is the number of desired clusters.
  • maxIterations is the maximum number of iterations to run.
  • initializationMode specifies either random initialization or initialization via k-means||.
  • runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).
  • initializationSteps determines the number of steps in the k-means|| algorithm.
  • epsilon determines the distance threshold within which we consider k-means to have converged.
  • initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.


The following shows implementation of K-Means algorithm with Python and Spark.

Following is the screen-shot of the the k-means program ran in spark using python.

Snapshot of text file use for k-means clustering and saved in home path.


Data Source

Used kmeans code with K=3 , WSS=30 and saved it as kmeans0.py in the home path.


Then from pyspark, run the following argument.


The Kmeans.out is saved in the home as follows:


The final 3 clusters are as follows:



























No comments:

Post a Comment