Amazon Athena is an interactive query service that makes it easy to analyze data directly in S3 using SQL. In this tutorial, we will show how to connect Athena with S3 and start using SQL for analyzing the files. Athena is serverless, and you pay only for the queries you run. It scales automatically which means that the queries run fast even with large datasets.
Assume that you have a csv file at your computer and you want to create a table in Athena and start running queries on it. For simplicity, we will work with the iris.csv dataset. The steps that we are going to follow are:
- Create an S3 Bucket
- Upload the iris.csv dataset to the S3 Bucket
- Set up a query location in S3 for the Athena queries
- Create a Database in Athena
- Create a table
- Run SQL queries
Create an S3 Bucket
You go to services and search for the Amazon S3. then you click on the orange button “Create bucket“
I created a bucket called “gpipis-iris-dataset“
Upload the iris.csv to the S3 Bucket
You click on the orange button “Upload” and you upload the iris.csv file from your local computer. As you can see, in the gpipis-iris-dataset bucket there is the iris.csv file. Notice that our iris.csv file does not contain headers
Set up a Query Location
In the console search for the service “Athena”. Once you are in Athena, go to setting and defining a location for the queries.
In the setting define the Query result location. I chose the “s3://gpipis-query-results-bucket/sql/“. Click “Save“
Create a Database
To create a database named
my_iris_db, enter the following CREATE DATABASE statement.
CREATE DATABASE mydatabase
and choose Run Query. Finally confirm that the catalog display refreshes and
mydatabase appears in the Database list in the navigation pane on the left.
Notice that we could create a database alternative as we will show below.
Create a Table
We can create a table with an SQL statement or using the console. In our database, we click on the “Create table” and then “from S3 bucket data“.
Then in the database, we choose “my_iris_db” that we created before, but we could also create a new one at this step. Then the table name can be “iris” and the location is our S3 bucket where is the iris.csv file, in my case is the “s3://gpipis-iris-dataset/“
Then we choose the format of the file. In our case is csv.
Then we define the column names and their data type.
Finally, we do not add a partition and we click on the Create table
At the same time, we can see the query that was generated and run under the hood.
CREATE EXTERNAL TABLE IF NOT EXISTS my_iris_db.iris ( `sepal.length` float, `sepal.width` float, `petal.length` float, `petal.width` float, `variety` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',' ) LOCATION 's3://gpipis-iris-dataset/' TBLPROPERTIES ('has_encrypted_data'='false');
As we can see, the table iris appeared in the “my_iris_db“
Run SQL queries
Now we are in a position to run any query that we want. Let’s say that we want to get the average of every sepal length and width for every variety.
select variety, avg("sepal.length") as sepal_length, avg("sepal.width") as sepal_width, avg("petal.length") as petal_length, avg("petal.width") as petal_width from my_iris_db.iris group by variety
Notice that when we created the table, it assumes that the csv file does not contain the headers. Otherwise, we need to define it in the SQL statement by adding the
Bonus Part: How to Convert CSV to Parquet
If you want to create a parquet table instead of csv you can run the following query.
CREATE TABLE my_iris_db.iris_parquet with ( external_location = 's3://gpipis-test-bucket/parquet/', format = 'PARQUET' ) as select * from iris
For more information please have a look at AWS documentation