Getting started with AWS Athena – Part 3

In previous blog (Part-2),  I created two tables using JSON and CSV format. In this post (part 3) I will talk about how one can explore dataset,  query large data with predicate filtering and some basic inner joins using Athena. Also, I will compare the performance with Hadoop cluster and AWS EMR.

For this benchmark I am comparing between following platforms:

  • AWS EMR (1 master, 4 cores [m3.xlarge])
  • On-Prem Hadoop cluster (4 nodes)
    • Hive
    • Impala
    • Hive+Spark
  • AWS Athema

First I need to set up my tables, again using similar method from previous blog; I just simply generated my DDL and created “external” table on top of my S3 dataset.

Before we I can create tables, I should give readers context about the dataset.  I downloaded dataset from data.gov and I am using “Consumer Compliant” dataset. For accessibility reasons, I am providing the direct link to dataset:

https://catalog.data.gov/dataset/consumer-complaint-database

Data.gov provides quite extensive amount of open data which can be used for benchmarks and data discovery. I downloaded “csv” formatted file and converted to JSON file.  I did some testing with JSON file, but number don’t seems to be accurate; so I will not include it in this post for now.

DDL for Text Table


 CREATE EXTERNAL TABLE IF NOT EXISTS default.Consumer_Complaint_csv (
 `date_received` string,
 `product` string,
 `sub-product` string,
 `issue` string,
 `sub-issue` string,
 `consumer-complaint-narrative` string,
 `company-public-response` string,
 `company` string,
 `state` string,
 `zip_code` int,
 `tags` string,
 `consumer-consent-provided` string,
 `submitted_via` string,
 `date_sent` string,
 `company-response` string,
 `timely_response` string,
 `consumer-disputed` string,
 `complaint_id` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION 's3://athena1s3bucket/csv/';

Great, I have my table created, now let’s execute some basic queries  to make sure I can access the data from S3 bucket.

select count(*) from Consumer_Complaint_csv;

I’ve created similar tables on AWS EMR and On-Prem Hadoop cluster, I used DDL below to create tables respectively:

create external Table default.Consumer_Complaint_csv
(
 `date_received` string,
 `product` string,
 `sub-product` string,
 `issue` string,
 `sub-issue` string,
 `consumer-complaint-narrative` string,
 `company-public-response` string,
 `company` string,
 `state` string,
 `zip_code` int,
 `tags` string,
 `consumer-consent-provided` string,
 `submitted_via` string,
 `date_sent` string,
 `company-response` string,
 `timely_response` string,
 `consumer-disputed` string,
 complaint_id int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION '/tmp/Consumer/csv' 
;

HDFS Cluster config:

Now that I have my tables created on all environment, let’s start executing some queries for benchmark purpose. I used queries below to extract data and for consistency purpose, structure is same across the board.

Q1  – Simple count

select count(*) from consumer_complaint_csv;

Q2 – Simple count with predicate filter

select count(*), zip_code From consumer_complaint_csv where state = 'TX' 
group by zip_code having count(*) > 20 ;

Q3 – Self inner join with predicate filter

select a.product, a.company , a.issue , count(*) as count_ttl
from default.Consumer_Complaint_csv a 
join default.Consumer_Complaint_csv b 
on (a.company = b.company)
where a.state = 'TX'
group by a.product, a.company , a.issue 
having count(*) > 50;

Q4 – Self inner join with predicate filter and in list

select a.product, a.company , a.issue , count(*) as count_ttl
from default.Consumer_Complaint_csv a 
join default.Consumer_Complaint_csv b 
on (a.company = b.company)
where a.state = 'TX'
and a.product in ('Debt collection','Credit card')
group by a.product, a.company , a.issue 
having count(*) > 50;

Looking at queries, nothing fancy, just simple sql queries. My goal here is to calculate performance and does AWS Athena holds up to it’s promise and performant enough.  I’m sure I can get better performance with parquet or orc file, but the goal here is to see if service work. I can say that I am impressed, not worrying about what is under the hood or infrastructure, it is a good tool.

Now let’s look at the numbers:

chart_benchamrk

Note:All the timings above are in “seconds.”

One thing to note is that dataset size are 500MB for Q3 & 4 due to self join. For Q1 & 2 dataset size is 263MB.

Conclusion

On aspect of performance, It is not bad but not great. keeping in mind that I don’t have to pay for underlying infrastructure, but only for my executes; that’s bang for the buck!

Overall I like the performance and I will certainly leverage Athena for my future designs.

I am not saying one should ditch the Hadoop cluster or EMR and start using Athena for on-going operations. I think Athena has it’s place in toolkit, can be good starting point to do data discovery and understanding data when one does not know quality of the data.

Hope this post helps understanding bit more about the AWS Athena as service. Do give it a try yourself and let me know your thoughts and comments.

Thanks!

Advertisements

Author: Abdul H Khan

Trying to be cloudy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s