Predictive Hacks

How to run SQL on S3 files with AWS Athena

aws athena

Amazon Athena is an interactive query service that makes it easy to analyze data directly in S3 using SQL. In this tutorial, we will show how to connect Athena with S3 and start using SQL for analyzing the files. Athena is serverless, and you pay only for the queries you run. It scales automatically which means that the queries run fast even with large datasets.

Scenario

Assume that you have a csv file at your computer and you want to create a table in Athena and start running queries on it. For simplicity, we will work with the iris.csv dataset. The steps that we are going to follow are:

  • Create an S3 Bucket
  • Upload the iris.csv dataset to the S3 Bucket
  • Set up a query location in S3 for the Athena queries
  • Create a Database in Athena
  • Create a table
  • Run SQL queries

Create an S3 Bucket

You go to services and search for the Amazon S3. then you click on the orange button “Create bucket

How to run SQL on S3 files with AWS Athena 1

I created a bucket called “gpipis-iris-dataset

Upload the iris.csv to the S3 Bucket

You click on the orange button “Upload” and you upload the iris.csv file from your local computer. As you can see, in the gpipis-iris-dataset bucket there is the iris.csv file. Notice that our iris.csv file does not contain headers

How to run SQL on S3 files with AWS Athena 2

Notice that we could have created an S3 bucket and uploaded the file using Boto3 or the AWS CLI.

Set up a Query Location

In the console search for the service “Athena”. Once you are in Athena, go to setting and defining a location for the queries.

How to run SQL on S3 files with AWS Athena 3

In the setting define the Query result location. I chose the “s3://gpipis-query-results-bucket/sql/“. Click “Save

How to run SQL on S3 files with AWS Athena 4

Create a Database

To create a database named my_iris_db, enter the following CREATE DATABASE statement.

CREATE DATABASE mydatabase

and choose Run Query. Finally confirm that the catalog display refreshes and mydatabase appears in the Database list in the navigation pane on the left.

How to run SQL on S3 files with AWS Athena 5

Notice that we could create a database alternative as we will show below.

Create a Table

We can create a table with an SQL statement or using the console. In our database, we click on the “Create table” and then “from S3 bucket data“.

How to run SQL on S3 files with AWS Athena 6

Then in the database, we choose “my_iris_db” that we created before, but we could also create a new one at this step. Then the table name can be “iris” and the location is our S3 bucket where is the iris.csv file, in my case is the “s3://gpipis-iris-dataset/

How to run SQL on S3 files with AWS Athena 7

Then we choose the format of the file. In our case is csv.

How to run SQL on S3 files with AWS Athena 8

Then we define the column names and their data type.

How to run SQL on S3 files with AWS Athena 9

Finally, we do not add a partition and we click on the Create table

At the same time, we can see the query that was generated and run under the hood.

How to run SQL on S3 files with AWS Athena 10
CREATE EXTERNAL TABLE IF NOT EXISTS my_iris_db.iris (
  `sepal.length` float,
  `sepal.width` float,
  `petal.length` float,
  `petal.width` float,
  `variety` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://gpipis-iris-dataset/'
TBLPROPERTIES ('has_encrypted_data'='false');
 

As we can see, the table iris appeared in the “my_iris_db

Run SQL queries

Now we are in a position to run any query that we want. Let’s say that we want to get the average of every sepal length and width for every variety.

select variety, avg("sepal.length") as sepal_length, avg("sepal.width") as sepal_width,
avg("petal.length") as petal_length, avg("petal.width") as petal_width
from my_iris_db.iris
group by variety
How to run SQL on S3 files with AWS Athena 11

Notice that when we created the table, it assumes that the csv file does not contain the headers. Otherwise, we need to define it in the SQL statement by adding the TBLPROPERTIES ('skip.header.line.count'='1')

Bonus Part: How to Convert CSV to Parquet

If you want to create a parquet table instead of csv you can run the following query.

CREATE TABLE my_iris_db.iris_parquet with (
  external_location = 's3://gpipis-test-bucket/parquet/',
  format = 'PARQUET'
)
as select * from iris 

For more information please have a look at AWS documentation

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

2 thoughts on “How to run SQL on S3 files with AWS Athena”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

snowflake
Miscellaneous

How to Schedule Tasks in Snowflake

We have started a series of Snowflake tutorials, like How to Get Data from Snowflake using Python, How to Load