SparkR config with Hadoop 2.0

I am engaged on gig where sparkR will be used to run R jobs and currently  I am working on config. Once I troubleshoot all issues I will post steps to get Spark cluster working.

Followup from my initial fustration with SparkR. I was pretty close to giving up as why SparkR would not work on cluster I was working on. After much follow up with Shivaram(SparkR package author) we were finally able to get SparkR working as cluster job.

SparkR can be downloaded from https://github.com/amplab-extras/SparkR-pkg

SparkR configuration

Install R

Instruction below are for Ubuntu

ALPHA root@host:~$ nano /etc/apt/sources.list
ALPHA root@host:~$ apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
ALPHA root@host:~$ add-apt-repository ppa:marutter/rdev
ALPHA root@host:~$ apt-get install python-software-properties
ALPHA root@host:~$ add-apt-repository ppa:marutter/rdev
ALPHA root@host:~$ apt-get update
ALPHA root@host:~$ apt-get install r-base

Conf Java for R

wget http://cran.cnr.berkeley.edu/src/contrib/rJava_0.9-6.tar.gz

sudo R CMD INSTALL rJava_0.9-6.tar.gz

Modify spark-env.sh
#!/usr/bin/env bash

export STANDALONE_SPARK_MASTER_HOST=hostname.domain.com

export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

export SPARK_LOCAL_IP=xxx.xxx.xxx.xxx

### Let’s run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark

if [ -n “$HADOOP_HOME” ]; then
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi

### Comment above 2 lines and uncomment the following if
### you want to run with scala version, that is included with the package
#export SCALA_HOME=${SCALA_HOME:-/usr/lib/spark/scala}
#export PATH=$PATH:$SCALA_HOME/bin

Note: This will need to done for worker nodes as well.

Switch user to HDFS
suhdfs
Git Clone

git clone https://github.com/amplab-extras/SparkR-pkg

Building SparkR

SPARK_HADOOP_VERSION=2.2.0-cdh5.0.0-beta-2
./install-dev.sh

Copy SparkRpkg to worker nodes

Example : scp –r SparkR-pkg hdfs@worker1:

Execute Test Job

cd SparkR-pkg/

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

source /etc/spark/conf/spark-env.sh

./sparkR examples/pi.R spark://hostname.domain.com:7077

Sample results
hdfs@xxxx:~/SparkR-pkg$ ./sparkR examples/pi.R spark://xxxx.xxxxx.com:7077
./sparkR: line 13: /tmp/sparkR.profile: Permission denied
Loading required package: SparkR
Loading required package: methods
Loading required package: rJava
[SparkR] Initializing with classpath /var/lib/hadoop-hdfs/SparkR-pkg/lib/SparkR/sparkr-assembly-0.1.jar

14/02/27 16:29:09 INFO Slf4jLogger: Slf4jLogger started
Pi is roughly 3.14018
Num elements in RDD 200000
hdfs@xxxx:~/SparkR-pkg$I