Getting Started with AWS Athena – Part 1

At last AWS ReInvent, AWS announced new service called “Athena” (Greek virgin goddess of reason). There has been a lot fuss about AWS S3 service, as I am getting more and more comfortable with AWS platform, I thought let put Athena to test.

In this five part series, we will do deep-dive how the service works out-of-box and with some customizations in mix.

So, let’s get our hands in the mix

  • Login to your AWS console
  • Search for Athena service
  • Once on landing page, you get option to create ELB table
  • You will get to tutorial page, let’s use that for our initial table creation.

  • Wizard will ask you to create new table

  • To keep structure more simple, I created new metadata database for all my Athena tables
    • DB Name = my_athena_db
    • Table name = my_first_athena_table
    • S3 example location = s3://athena-examples-us-east-2/elb/plaintext

  • Next, since we are going to mine weblog – AWS provide RegEx for us

  • Next, pre-populate the columns using prebuilt method

  • Verify and validate everything looks good

  • Execute “Run Query”

So, I created external table using S3 bucket, DDL is below:

CREATE EXTERNAL TABLE IF NOT EXISTS my_athena_db.my_first_athena_table (
  `request_timestamp` string,
  `elb_name` string,
  `request_ip` string,
  `request_port` int,
  `backend_ip` string,
  `backend_port` int,
  `request_processing_time` double,
  `backend_processing_time` double,
  `client_response_time` double,
  `elb_response_code` string,
  `backend_response_code` string,
  `received_bytes` bigint,
  `sent_bytes` bigint,
  `request_verb` string,
  `url` string,
  `protocol` string,
  `user_agent` string,
  `ssl_cipher` string,
  `ssl_protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$'
) LOCATION 's3://athena-examples-us-west-2/elb/plaintext/';

Notice that above format looks quite similar to Hive SerDe table.

Now that I have my table created, let take a look what we can do with it..

select count(*) from my_athena_db.my_first_athena_table;

Pretty cool! I just extracted data from S3 bucket without having any database/Hadoop cluster. Although, I don’t think performance will be that great with Athena, but I think depending on use-case this service might be great and cost effective.

Now that we are on subject of cost, with Athena, you only pay for Queries that you execute. there is no cost compute or storage(Storage cost still apply for S3).

Another feature that I noticed, Athena front-end/UI looks too similar to Cloudera HUE, but I can’t find any document that is referencing that.

But, from offering aspect, I think Athena will do great! comments or feedback are welcome!

Note: Athena is currently only available in North Virginia, Ohio and Oregon.

Advertisements

Author: Abdul H Khan

Trying to be cloudy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s