Getting started with AWS Athena – Part 4

In previous blog (Part-2),  I created two tables using JSON and CSV format. In this post (part 3) I will talk about how one can explore dataset,  query large data with predicate filtering and some basic inner joins using Athena. Also, I will compare the performance with Hadoop cluster and AWS EMR. For this benchmark […]

In previous blog (Part-3),  I compared basic workload with Athena and other query engines, both on-prem and cloud based solution.  In this post, we will do bit deep dive, understand how the service works and how Amazon build Athena service.

First understand the service flow, figure below explains how flow works with AWS Athena service and how you can take the cold data and run analytics on data-set.

Athena flow

Let’s decouple the entire flow –

  • When you create table, the table metadata is stored in metadata indicated with red arrow.
  • The table definition has a reference of where data resides in S3 bucket indicated in blue pointers.
  • Also, Athena will also create S3 bucket to store service logs indicated in doted line
  • AWS Athena rely on Presto query in-memory engine for fast query analytics 
  • The results either can be displayed on the Athena console or can be pushed to AWS QuickSight for data slice and dice.
  • With AWS Quicksight, it is great way to understand, slice and dice data and publish dashboards.

There are some limitations with AWS Athena shown in table below:

Service limits

 

Athena Service limitations 
Action  Limit
Parallel submit 1
Parallel query executions 5
Number of databases 100
Tables per database 100
Partitions per table 20K
S3 buket – log log bucket for service outputs

Conclusion

Again, AWS Athena is good way to start learning about your data quality, data trend and converting raw data to dashboards with few clicks.

In Part-5 I will touch more on AWS Athena + QuickSight and how data can be quickly converted to dashboards.

Hope this post helps understand how AWS Athena workflow.  Comments and questions are welcomed!

Thanks!

Getting started with AWS Athena – Part 3

In previous blog (Part-2),  I created two tables using JSON and CSV format. In this post (part 3) I will talk about how one can explore dataset,  query large data with predicate filtering and some basic inner joins using Athena. Also, I will compare the performance with Hadoop cluster and AWS EMR.

For this benchmark I am comparing between following platforms:

  • AWS EMR (1 master, 4 cores [m3.xlarge])
  • On-Prem Hadoop cluster (4 nodes)
    • Hive
    • Impala
    • Hive+Spark
  • AWS Athema

First I need to set up my tables, again using similar method from previous blog; I just simply generated my DDL and created “external” table on top of my S3 dataset.

Before we I can create tables, I should give readers context about the dataset.  I downloaded dataset from data.gov and I am using “Consumer Compliant” dataset. For accessibility reasons, I am providing the direct link to dataset:

https://catalog.data.gov/dataset/consumer-complaint-database

Data.gov provides quite extensive amount of open data which can be used for benchmarks and data discovery. I downloaded “csv” formatted file and converted to JSON file.  I did some testing with JSON file, but number don’t seems to be accurate; so I will not include it in this post for now.

DDL for Text Table


 CREATE EXTERNAL TABLE IF NOT EXISTS default.Consumer_Complaint_csv (
 `date_received` string,
 `product` string,
 `sub-product` string,
 `issue` string,
 `sub-issue` string,
 `consumer-complaint-narrative` string,
 `company-public-response` string,
 `company` string,
 `state` string,
 `zip_code` int,
 `tags` string,
 `consumer-consent-provided` string,
 `submitted_via` string,
 `date_sent` string,
 `company-response` string,
 `timely_response` string,
 `consumer-disputed` string,
 `complaint_id` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION 's3://athena1s3bucket/csv/';

Great, I have my table created, now let’s execute some basic queries  to make sure I can access the data from S3 bucket.

select count(*) from Consumer_Complaint_csv;

I’ve created similar tables on AWS EMR and On-Prem Hadoop cluster, I used DDL below to create tables respectively:

create external Table default.Consumer_Complaint_csv
(
 `date_received` string,
 `product` string,
 `sub-product` string,
 `issue` string,
 `sub-issue` string,
 `consumer-complaint-narrative` string,
 `company-public-response` string,
 `company` string,
 `state` string,
 `zip_code` int,
 `tags` string,
 `consumer-consent-provided` string,
 `submitted_via` string,
 `date_sent` string,
 `company-response` string,
 `timely_response` string,
 `consumer-disputed` string,
 complaint_id int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION '/tmp/Consumer/csv' 
;

HDFS Cluster config:

Now that I have my tables created on all environment, let’s start executing some queries for benchmark purpose. I used queries below to extract data and for consistency purpose, structure is same across the board.

Q1  – Simple count

select count(*) from consumer_complaint_csv;

Q2 – Simple count with predicate filter

select count(*), zip_code From consumer_complaint_csv where state = 'TX' 
group by zip_code having count(*) > 20 ;

Q3 – Self inner join with predicate filter

select a.product, a.company , a.issue , count(*) as count_ttl
from default.Consumer_Complaint_csv a 
join default.Consumer_Complaint_csv b 
on (a.company = b.company)
where a.state = 'TX'
group by a.product, a.company , a.issue 
having count(*) > 50;

Q4 – Self inner join with predicate filter and in list

select a.product, a.company , a.issue , count(*) as count_ttl
from default.Consumer_Complaint_csv a 
join default.Consumer_Complaint_csv b 
on (a.company = b.company)
where a.state = 'TX'
and a.product in ('Debt collection','Credit card')
group by a.product, a.company , a.issue 
having count(*) > 50;

Looking at queries, nothing fancy, just simple sql queries. My goal here is to calculate performance and does AWS Athena holds up to it’s promise and performant enough.  I’m sure I can get better performance with parquet or orc file, but the goal here is to see if service work. I can say that I am impressed, not worrying about what is under the hood or infrastructure, it is a good tool.

Now let’s look at the numbers:

chart_benchamrk

Note:All the timings above are in “seconds.”

One thing to note is that dataset size are 500MB for Q3 & 4 due to self join. For Q1 & 2 dataset size is 263MB.

Conclusion

On aspect of performance, It is not bad but not great. keeping in mind that I don’t have to pay for underlying infrastructure, but only for my executes; that’s bang for the buck!

Overall I like the performance and I will certainly leverage Athena for my future designs.

I am not saying one should ditch the Hadoop cluster or EMR and start using Athena for on-going operations. I think Athena has it’s place in toolkit, can be good starting point to do data discovery and understanding data when one does not know quality of the data.

Hope this post helps understanding bit more about the AWS Athena as service. Do give it a try yourself and let me know your thoughts and comments.

Thanks!

Getting started with AWS Athena – Part 2

In previous blog, I talked about how to get going with Athena as service.  In this post, we will create two type of tables:

  • Table on existing text data under S3 bucket
  • Table on S3 bucket, formatted by json

From looking at the structure AWS is converting their S3 storage as HDFS external storage via HiveServer2 and HiveMeta store. So, suppose  you have table in Hadoop and you need to migrate the data to S3 or you are landing the data into S3 directly; features like Hive SerDe can transform your data-set on fly when accessing from S3 bucket. Interesting feature, Now let see it in action:

I assume you already have some data in S3 bucket or copy data from from source to S3 bucket.  I might do another post explaining how to extend your HDFS cluster with S3.

First table that I will create is based on comma separated textfile:

DDL for Text Table

CREATE EXTERNAL TABLE IF NOT EXISTS default.kinesis_stream_test (
 `trans_id` string,
 `trans_num` string,
 `source_ip` string,
 `kinesis_stream_event` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION 's3://myS3bucket/kinesis_stream/text/'

Interesting thing to notice here is that I am not using “create table wizard”, I can simple generate my DDL and create external table on top of my data.  All I need is my S3 end-point and my table is ready.  I used similar method for both Text dataset and JSON formatted tables.

athena-text-table

Sample dataset staged in S3 bucket:

ff6d3ee7-cd2b-4d62-8a5c-3e50ff96f120,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
92abd107-333b-4749-8b4c-64fa78a6f8d3,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
a67deb57-36c6-48cd-99b3-0cb89814c1bf,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
96dafd35-b8d1-4334-bb01-73796e823be5,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
2eec7e53-e696-4a93-a541-1f3f2757e804,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
b16e4fa4-8971-4dd9-9701-20a60f603618,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
1e8be437-66d8-4262-9409-5638f7305c33,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
b7345fde-73e5-4fb2-945d-7189af290ea9,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
ac324869-aa5f-4c7b-865d-210ec79a14d8,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006

DDL for JSON Table

CREATE EXTERNAL TABLE IF NOT EXISTS default.kinesis_stock_stream (
 `change` double,
 `price` double,
 `sector` string,
 `ticker_symbol` string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = '1'
) LOCATION 's3://aeg-awsfirehose/2017/01/';

Getting Started with AWS Athena – Part 1

At last AWS ReInvent, AWS announced new service called “Athena” (Greek virgin goddess of reason). There has been a lot fuss about AWS S3 service, as I am getting more and more comfortable with AWS platform, I thought let put Athena to test.

In this five part series, we will do deep-dive how the service works out-of-box and with some customizations in mix.

So, let’s get our hands in the mix

  • Login to your AWS console
  • Search for Athena service
  • Once on landing page, you get option to create ELB table
  • You will get to tutorial page, let’s use that for our initial table creation.

  • Wizard will ask you to create new table

  • To keep structure more simple, I created new metadata database for all my Athena tables
    • DB Name = my_athena_db
    • Table name = my_first_athena_table
    • S3 example location = s3://athena-examples-us-east-2/elb/plaintext

  • Next, since we are going to mine weblog – AWS provide RegEx for us

  • Next, pre-populate the columns using prebuilt method

  • Verify and validate everything looks good

  • Execute “Run Query”

So, I created external table using S3 bucket, DDL is below:

CREATE EXTERNAL TABLE IF NOT EXISTS my_athena_db.my_first_athena_table (
  `request_timestamp` string,
  `elb_name` string,
  `request_ip` string,
  `request_port` int,
  `backend_ip` string,
  `backend_port` int,
  `request_processing_time` double,
  `backend_processing_time` double,
  `client_response_time` double,
  `elb_response_code` string,
  `backend_response_code` string,
  `received_bytes` bigint,
  `sent_bytes` bigint,
  `request_verb` string,
  `url` string,
  `protocol` string,
  `user_agent` string,
  `ssl_cipher` string,
  `ssl_protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$'
) LOCATION 's3://athena-examples-us-west-2/elb/plaintext/';

Notice that above format looks quite similar to Hive SerDe table.

Now that I have my table created, let take a look what we can do with it..

select count(*) from my_athena_db.my_first_athena_table;

Pretty cool! I just extracted data from S3 bucket without having any database/Hadoop cluster. Although, I don’t think performance will be that great with Athena, but I think depending on use-case this service might be great and cost effective.

Now that we are on subject of cost, with Athena, you only pay for Queries that you execute. there is no cost compute or storage(Storage cost still apply for S3).

Another feature that I noticed, Athena front-end/UI looks too similar to Cloudera HUE, but I can’t find any document that is referencing that.

But, from offering aspect, I think Athena will do great! comments or feedback are welcome!

Note: Athena is currently only available in North Virginia, Ohio and Oregon.

Resuming blogging!

Back in Action!

It has been a while since I posted anything, I have been busy learning new things. I will share my journey in multi-part series.

I will first start with how tech industry is changing and how individuals needs to evolve and learn new skill sets.

My days typically have been very tedious and running 100 MPH trying to get things done. Sometimes I feel very exhausted and want to just give-up, but my love for technology and solving problems always pull me back.

As of last summer, I completed my AWS solution architect certification and hoping to build on Amazon skill set. I am amazed with AWS platform and enjoying learning new things everyday.

I will start to post my learning and challenges that I have gone through, also I would like to start two-way conversion and understand what readers like me to focus on bit more.

Hope to see you soon!

Thanks,

Big data SQL in action

Recently I have been engaged in implementing Oracle Big Data connector for customer we are helping.

Here is a preview of Big data SQL connector which can benefits from both Exadata smart scans and Hadoop massive palatalization…I will publish article in future and steps to implement BD SQL with Exadata and BDA appliance.

BDSQL> select /*+ MONITOR */ /* TESTAHK_YR */ count(*) FROM ORA_FLIGHTS group by YEAR;

COUNT(*)
———-
5411843
5967780
5683047
5270893
5327435
7129270
5180048
5271359
5076925
22
7140596
5070501
7141922
5527884
5384721
1311826
5351983
7453215
5041200
5202096
6488540
5092157
7009728
Elapsed: 00:00:15.34

Execution Plan
———————————————————-
Plan hash value: 3679660899

————————————————————————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
————————————————————————————————————————————–
| 0 | SELECT STATEMENT | | 22 | 88 | 204K (2)| 00:00:08 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 22 | 88 | 204K (2)| 00:00:08 | Q1,01 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 22 | 88 | 204K (2)| 00:00:08 | Q1,01 | PCWP | |
| 4 | PX RECEIVE | | 22 | 88 | 204K (2)| 00:00:08 | Q1,01 | PCWP | |
| 5 | PX SEND HASH | :TQ10000 | 22 | 88 | 204K (2)| 00:00:08 | Q1,00 | P->P | HASH |
| 6 | HASH GROUP BY | | 22 | 88 | 204K (2)| 00:00:08 | Q1,00 | PCWP | |
| 7 | PX BLOCK ITERATOR | | 123M| 471M| 202K (1)| 00:00:08 | Q1,00 | PCWC | |
| 8 | EXTERNAL TABLE ACCESS STORAGE FULL| ORA_FLIGHTS | 123M| 471M| 202K (1)| 00:00:08 | Q1,00 | PCWP | |
————————————————————————————————————————————–

Note
—–
– Degree of Parallelism is 2 because of table property
Statistics
———————————————————-
293 recursive calls
100 db block gets
302 consistent gets
8 physical reads
0 redo size
995 bytes sent via SQL*Net to client
563 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
23 rows processed

BDSQL>

BDS