Sunday, 10 July 2016

Word Count Program with Spark and Python

Spark






Programming Spark applications is similar to other data flow languages that had previously been implemented on Hadoop. Code is written in a driver program which is lazily evaluated, and upon an action, the driver code is distributed across the cluster to be executed by workers on their partitions of the RDD. Results are then sent back to the driver for aggregation or compilation. Essentially the driver program creates one or more RDDs, applies operations to transform the RDD, then invokes some action on the transformed RDD.
These steps are outlined as follows:
  1. Define one or more RDDs either through accessing data stored on disk (HDFS, Cassandra, HBase, Local Disk), parallelizing some collection in memory, transforming an existing RDD, or by cachingor saving.
  2. Invoke operations on the RDD by passing closures (functions) to each element of the RDD. Spark offers over 80 high level operators beyond Map and Reduce.
  3. Use the resulting RDDs with actions (e.g. count, collect, save, etc.). Actions kick off the computing on the cluster.
When Spark runs a closure on a worker, any variables used in the closure are copied to that node, but are maintained within the local scope of that closure. Spark provides two types of shared variables that can be interacted with by all workers in a restricted fashion. Broadcast variables are distributed to all workers, but are read-only. Broadcast variables can be used as lookup tables or stopword lists. Accumulators are variables that workers can "add" to using associative operations and are typically used as counters.

The following is the code for word count program with python and spark.
Let us do a word count of the following text:
In the terminal write the following codes:
Next proceed as follows:
Continue writing the map codes:
And the reduce codes as follows:
Next write the following :
Finally got the result,
The counted text file is also saved in the “wc” folder as follows:












K-means Program with spark and python

K-means
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The spark.mllib implementation includes a parallelized variant of the k-means++ method called kmeans. The implementation in spark.mllib has the following parameters:
  • k is the number of desired clusters.
  • maxIterations is the maximum number of iterations to run.
  • initializationMode specifies either random initialization or initialization via k-means||.
  • runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).
  • initializationSteps determines the number of steps in the k-means|| algorithm.
  • epsilon determines the distance threshold within which we consider k-means to have converged.
  • initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.


The following shows implementation of K-Means algorithm with Python and Spark.

Following is the screen-shot of the the k-means program ran in spark using python.

Snapshot of text file use for k-means clustering and saved in home path.


Data Source

Used kmeans code with K=3 , WSS=30 and saved it as kmeans0.py in the home path.


Then from pyspark, run the following argument.


The Kmeans.out is saved in the home as follows:


The final 3 clusters are as follows:



























Installing Spark on Ubuntu using VirtualBox

Installing Spark on Ubuntu


This article describes the step-by-step approach to build and run Apache Spark 1.6.2 on Ubuntu. I’ve used Ubuntu 16 on VirtualBox 5.0.24 for the purpose of this blog post.

Below is the detailed steps to set up. 
Installation Steps:

  1. Install Virtualbox
  2. Install Ubuntu on virtualbox
  3. Install Java
  4. Setting up Spark on Ubuntu
Step 1: Install Virtualbox on Windows Machine



Step 2: Install Ubuntu on virtualbox

  • First download Ubuntu 16.04 Xenial


  • Install Ubuntu on VirtualBox


Step 3: Install Java
For running Spark in Ubuntu machine should install Java. Using following commands easily install Java in Ubuntu machine.


To check the Java installation is successful

Step 4: Setting up Spark on Ubuntu

Download Spark
I) Go to this site and choose the following options:
  • Choose a Spark release: pick the latest
  • Choose a package type: Source code [can build several Hadoop versions]
  • Choose a download type: Select Direct Download


II) Unizip the spark folder and rename it as spark.



III) Edit your BASH profile to add Spark to your PATH and to set the SPARK_HOME environment variable. These helpers will assist you on the command line. On Ubuntu, simply edit the ~/.bash_profile or ~/.profile files and add the following:



Type pyspark to run Spark




During loading pyspark module into ipython following error may come up:

No module named py4j.java_gateway


To resolve this use Run the following command to find the py4j.java_gateway.

PySpark find py4j.java_gateway?

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

IV) After you source your profile (or simply restart your terminal), you should now be able to run a pyspark interpreter locally. Execute the pyspark command, and you should see a result as follows:


V) To check the Spark installation is successful