For scalable and in-memory data science / machine learning, SparkR is becoming hot for many.

To have some ideas, please go through below blog posts:

  1. https://databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
  2. https://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent-warmerdam/
  3. http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

In this tutorial I will explain, step-by-step, the process of installing Spark on Windows machines and configuring Spark with R.

Step 1

Download the Spark’s latest version (I have downloaded  Version 1.6.1, March 2016 release), which is recommended for Hadoop ecosystem and MLlib (Spark’s Machine Learning Libraries) functionalities, from here : Download Spark

pic1

In the above picture, I have shown the choices for my download.

Step 2

Next, unzip the downloaded zipped file. I have chosen C:\spark-1.6.1-bin-hadoop2.6 folder on my windows (you can chose any of your choice).

Further, I have to setup the Locations of default spark directory and bin/sbin’s path into Environment Variables. Below picture shows you to note the paths (marked with Red Arrows).

pic2

For that, you search for “Environment” in everywhere option of search, and this will pop up the below shown windows:

pic3pic4pic5

pic6

Now write the following path into environment (e.g. user variables) and click OK to close window:

C:\spark-1.6.1-bin-hadoop2.6; C:\spark-1.6.1-bin-hadoop2.6\bin; C:\spark-1.6.1-bin-hadoop2.6

Similarly Enter the same in “Path” opti0n (scroll down a bit to find it) under “System Variables” below.

Click Ok Ok to close all the poped up windows.

Step 3

Install latest R (for me it is 3.3.0) and RStudio. Then add the path of R software path to the PATH  variable (as in steps shown in above Step 2).

I added this a little bit differently because of my other data science work:

variable name : R_HOME

variable value : C:\Program Files\R\R-3.3.0;C:\Program Files\R\R-3.3.0\bin\;C:\Program Files\R\R-3.3.0\bin\x64\;

pic7pic8

Click OK OK and close all the poped up windows.

Step 4

Run command prompt as an administrator and execute the command “SparkR” from the command prompt.

If all is good, properly configured and command successful executed, you should see the message Spark context is available … ” as shown in below picture:

pic9

Step 5

Finally, we will configure SparkR inside the RStudio to connect to Spark.

Open up RStudio and a new script.

Write the below code and then execute them one by one (please dont forget to modify the paths according to your’s):

# Setting up “SPARK_HOME” environment variable
Sys.setenv(SPARK_HOME = “C:/spark-1.6.1-bin-hadoop2.6”)

# Set the library path
.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”),”R”,”lib”), .libPaths()))

# Loading the SparkR Libary
library(SparkR)

If run successfully, you should see the following message in RStudio’s console…

pic10

That’s it!!

It’s all set up and you are good to go to explore SparkR on windows Machine.

 

 

Advertisements