Sunday, 10 July 2016

Installing Spark on Ubuntu using VirtualBox

Installing Spark on Ubuntu


This article describes the step-by-step approach to build and run Apache Spark 1.6.2 on Ubuntu. I’ve used Ubuntu 16 on VirtualBox 5.0.24 for the purpose of this blog post.

Below is the detailed steps to set up. 
Installation Steps:

  1. Install Virtualbox
  2. Install Ubuntu on virtualbox
  3. Install Java
  4. Setting up Spark on Ubuntu
Step 1: Install Virtualbox on Windows Machine



Step 2: Install Ubuntu on virtualbox

  • First download Ubuntu 16.04 Xenial


  • Install Ubuntu on VirtualBox


Step 3: Install Java
For running Spark in Ubuntu machine should install Java. Using following commands easily install Java in Ubuntu machine.


To check the Java installation is successful

Step 4: Setting up Spark on Ubuntu

Download Spark
I) Go to this site and choose the following options:
  • Choose a Spark release: pick the latest
  • Choose a package type: Source code [can build several Hadoop versions]
  • Choose a download type: Select Direct Download


II) Unizip the spark folder and rename it as spark.



III) Edit your BASH profile to add Spark to your PATH and to set the SPARK_HOME environment variable. These helpers will assist you on the command line. On Ubuntu, simply edit the ~/.bash_profile or ~/.profile files and add the following:



Type pyspark to run Spark




During loading pyspark module into ipython following error may come up:

No module named py4j.java_gateway


To resolve this use Run the following command to find the py4j.java_gateway.

PySpark find py4j.java_gateway?

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

IV) After you source your profile (or simply restart your terminal), you should now be able to run a pyspark interpreter locally. Execute the pyspark command, and you should see a result as follows:


V) To check the Spark installation is successful
















No comments:

Post a Comment