Big data SQL in action

Recently I have been engaged in implementing Oracle Big Data connector for customer we are helping.

Here is a preview of Big data SQL connector which can benefits from both Exadata smart scans and Hadoop massive palatalization…I will publish article in future and steps to implement BD SQL with Exadata and BDA appliance.

BDSQL> select /*+ MONITOR */ /* TESTAHK_YR */ count(*) FROM ORA_FLIGHTS group by YEAR;

COUNT(*)
———-
5411843
5967780
5683047
5270893
5327435
7129270
5180048
5271359
5076925
22
7140596
5070501
7141922
5527884
5384721
1311826
5351983
7453215
5041200
5202096
6488540
5092157
7009728
Elapsed: 00:00:15.34

Execution Plan
———————————————————-
Plan hash value: 3679660899

————————————————————————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
————————————————————————————————————————————–
| 0 | SELECT STATEMENT | | 22 | 88 | 204K (2)| 00:00:08 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 22 | 88 | 204K (2)| 00:00:08 | Q1,01 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 22 | 88 | 204K (2)| 00:00:08 | Q1,01 | PCWP | |
| 4 | PX RECEIVE | | 22 | 88 | 204K (2)| 00:00:08 | Q1,01 | PCWP | |
| 5 | PX SEND HASH | :TQ10000 | 22 | 88 | 204K (2)| 00:00:08 | Q1,00 | P->P | HASH |
| 6 | HASH GROUP BY | | 22 | 88 | 204K (2)| 00:00:08 | Q1,00 | PCWP | |
| 7 | PX BLOCK ITERATOR | | 123M| 471M| 202K (1)| 00:00:08 | Q1,00 | PCWC | |
| 8 | EXTERNAL TABLE ACCESS STORAGE FULL| ORA_FLIGHTS | 123M| 471M| 202K (1)| 00:00:08 | Q1,00 | PCWP | |
————————————————————————————————————————————–

Note
—–
– Degree of Parallelism is 2 because of table property
Statistics
———————————————————-
293 recursive calls
100 db block gets
302 consistent gets
8 physical reads
0 redo size
995 bytes sent via SQL*Net to client
563 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
23 rows processed

BDSQL>

BDS

SparkR config with Hadoop 2.0

I am engaged on gig where sparkR will be used to run R jobs and currently  I am working on config. Once I troubleshoot all issues I will post steps to get Spark cluster working.

Followup from my initial fustration with SparkR. I was pretty close to giving up as why SparkR would not work on cluster I was working on. After much follow up with Shivaram(SparkR package author) we were finally able to get SparkR working as cluster job.

SparkR can be downloaded from https://github.com/amplab-extras/SparkR-pkg

SparkR configuration

Install R

Instruction below are for Ubuntu

ALPHA root@host:~$ nano /etc/apt/sources.list
ALPHA root@host:~$ apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
ALPHA root@host:~$ add-apt-repository ppa:marutter/rdev
ALPHA root@host:~$ apt-get install python-software-properties
ALPHA root@host:~$ add-apt-repository ppa:marutter/rdev
ALPHA root@host:~$ apt-get update
ALPHA root@host:~$ apt-get install r-base

Conf Java for R

wget http://cran.cnr.berkeley.edu/src/contrib/rJava_0.9-6.tar.gz

sudo R CMD INSTALL rJava_0.9-6.tar.gz

Modify spark-env.sh
#!/usr/bin/env bash

export STANDALONE_SPARK_MASTER_HOST=hostname.domain.com

export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

export SPARK_LOCAL_IP=xxx.xxx.xxx.xxx

### Let’s run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark

if [ -n “$HADOOP_HOME” ]; then
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi

### Comment above 2 lines and uncomment the following if
### you want to run with scala version, that is included with the package
#export SCALA_HOME=${SCALA_HOME:-/usr/lib/spark/scala}
#export PATH=$PATH:$SCALA_HOME/bin

Note: This will need to done for worker nodes as well.

Switch user to HDFS
suhdfs
Git Clone

git clone https://github.com/amplab-extras/SparkR-pkg

Building SparkR

SPARK_HADOOP_VERSION=2.2.0-cdh5.0.0-beta-2
./install-dev.sh

Copy SparkRpkg to worker nodes

Example : scp –r SparkR-pkg hdfs@worker1:

Execute Test Job

cd SparkR-pkg/

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

source /etc/spark/conf/spark-env.sh

./sparkR examples/pi.R spark://hostname.domain.com:7077

Sample results
hdfs@xxxx:~/SparkR-pkg$ ./sparkR examples/pi.R spark://xxxx.xxxxx.com:7077
./sparkR: line 13: /tmp/sparkR.profile: Permission denied
Loading required package: SparkR
Loading required package: methods
Loading required package: rJava
[SparkR] Initializing with classpath /var/lib/hadoop-hdfs/SparkR-pkg/lib/SparkR/sparkr-assembly-0.1.jar

14/02/27 16:29:09 INFO Slf4jLogger: Slf4jLogger started
Pi is roughly 3.14018
Num elements in RDD 200000
hdfs@xxxx:~/SparkR-pkg$I

Hadoop cluster deployment

I have successfully created multiple Hadoop clusters, the biggest hurdle i have ran into is documentation.

Document either missing key steps or due to environment differences documentation does not apply. following is list of clusters i have created:

Hadoop Cloudera single node Master/Datanode
Apache Hadoop manual install by downloading pkg’s
Apache Hadoop CDH4 3 node cluster.
Apache Hadoop CDH4 7 node cluster.

I’d like to hear what other people have to say about their experience.

Hadoop HDFS database

I have recently start to mess with Hadoop HDFS database system.

Here is what i plan to do:

  1. Create stand alone HDFS
  2. Create 3 node HDFS cluster
  3. Load test on both single node and HDFS
  4. load test database with Oracle, MySQL and HDFS databases

I should have identical hardware to perform load test…keep checking for setup and results

Also, i am interested to know how many people are interested in Hadoop.

 

Thanks,