Getting started with AWS Athena – Part 2

In previous blog, I talked about how to get going with Athena as service.  In this post, we will create two type of tables:

  • Table on existing text data under S3 bucket
  • Table on S3 bucket, formatted by json

From looking at the structure AWS is converting their S3 storage as HDFS external storage via HiveServer2 and HiveMeta store. So, suppose  you have table in Hadoop and you need to migrate the data to S3 or you are landing the data into S3 directly; features like Hive SerDe can transform your data-set on fly when accessing from S3 bucket. Interesting feature, Now let see it in action:

I assume you already have some data in S3 bucket or copy data from from source to S3 bucket.  I might do another post explaining how to extend your HDFS cluster with S3.

First table that I will create is based on comma separated textfile:

DDL for Text Table

CREATE EXTERNAL TABLE IF NOT EXISTS default.kinesis_stream_test (
 `trans_id` string,
 `trans_num` string,
 `source_ip` string,
 `kinesis_stream_event` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION 's3://myS3bucket/kinesis_stream/text/'

Interesting thing to notice here is that I am not using “create table wizard”, I can simple generate my DDL and create external table on top of my data.  All I need is my S3 end-point and my table is ready.  I used similar method for both Text dataset and JSON formatted tables.

athena-text-table

Sample dataset staged in S3 bucket:

ff6d3ee7-cd2b-4d62-8a5c-3e50ff96f120,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
92abd107-333b-4749-8b4c-64fa78a6f8d3,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
a67deb57-36c6-48cd-99b3-0cb89814c1bf,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
96dafd35-b8d1-4334-bb01-73796e823be5,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
2eec7e53-e696-4a93-a541-1f3f2757e804,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
b16e4fa4-8971-4dd9-9701-20a60f603618,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
1e8be437-66d8-4262-9409-5638f7305c33,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
b7345fde-73e5-4fb2-945d-7189af290ea9,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006
ac324869-aa5f-4c7b-865d-210ec79a14d8,201701210601,ip-172-31-36-113,AWS Kinesis stream test # 2017012106011484980006

DDL for JSON Table

CREATE EXTERNAL TABLE IF NOT EXISTS default.kinesis_stock_stream (
 `change` double,
 `price` double,
 `sector` string,
 `ticker_symbol` string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = '1'
) LOCATION 's3://aeg-awsfirehose/2017/01/';
Advertisements

Author: Abdul H Khan

Trying to be cloudy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s