Introduction
I attended the Introduction to Designing Data Lakes in AWS course in Coursera where there was a lab about Glue and I found it very useful and that is why I decided to share it here.
In this tutorial we will show how:
- How to define a database
- How to configure a crawler to explore data in Amazon S3 Bucket
- How to create tables
- How to transform the CSV files into Parquet and to create a table for it
- How to query data with Amazon Athena
Create an S3 Bucket
The first thing that you need to do is to create an S3 bucket. For this example I have created an S3 bucket called glue-aa60b120
Discovering the Data
In this section we will create the Glue database, add a crawler and populate the database tables using a source CSV file.
a) Choose Services and search for AWS Glue. Choose Databases. Choose Add database. Paste/type in the following for the Database name:
nycitytaxi
Choose Create
b) Choose Tables. You can add a table manually or by using a crawler. A crawler is a program that connects to a data store and progresses through a prioritized list of classifiers to determine the schema for your data. AWS Glue provides classifiers for common file types like CSV, JSON, Avro, and others. You can also write your own classifier using a grok pattern.
c) Choose Add tables using a crawler. Paste in nytaxicrawler
for the Crawler name. Choose Next. Leave Data stores selected for Crawler source type. Choose Next.
d) Leave S3 selected for Choose a data store. Leave Specified path in my account selected under Crawl data in. For the Include path paste in the following:
s3://aws-tc-largeobjects/DEV-AWS-MO-Designing_DataLakes/week3/
This S3 bucket contains the data file consisting of all the rides for the green taxis for the month of January 2020.
Choose Next. Leave No selected under Add another data store. Choose Next again.
e) On the Choose an IAM role page select Choose an existing IAM role provided that it has the required rights
f) For Frequency leave Run on demand selected. Choose Next.
g) For Database choose the nycitytaxi
database. Choose Next. Choose Finish. Select the nytaxicrawler crawler and choose Run crawler. When the crawler has finished, one table has been added. You will see that Tables added now shows 1.
h) After the job stops. Choose Tables. Choose week3. This screen describes the table, including schema, properties, and other valuable information. You can choose Edit schema if you want to take a look at the schema information.
Transform the data from CSV to Parquet
a) Under ETL at the left, choose Jobs. Choose Add job. Paste in the following for the Name:
nytaxi-csv-parquet
b) For the IAM role choose your existing one from the drop down. It will create the nytaxi-csv-parquet script that will be auto generated by AWS Glue.
c) For the S3 path where the script is stored use your S3 bucket instead of the auto created bucket:
s3://glue-aa60b120/admin
Do the same for the Temporary directory:
s3://glue-aa60b120/temp
Choose Next.
d) Select week3 and choose Next. Leave Change schema selected and choose Next. Choose Create tables in your data target. Change the Data store to Amazon S3. Change the Format to Parquet. For Target path paste in your S3 bucket:
s3://glue-aa60b120/data
Choose Next.
e) Verify the schema mapping, and choose Save job and edit script. Close the Script editor tips window. Feel free to have a look at the script.
f) Choose Save and then Run job. On the Parameters pop up choose Run job. In the Logs tab you should only see null_fields[] and the Run job button will be high-lighted again. If you are curious you can find the parquet file in your S3 bucket.
g) Choose Services and AWS Glue. Choose Crawlers.
h) Choose Add crawler. Paste in the following for the Crawler name:
nytaxiparquet
Choose Next. Leave Data stores selected for Crawler source type. Choose Next.
i) Leave S3 selected for Choose a data store. Leave Specified path in my account for Crawl data in. Again paste in your bucket with the prefix where the parquet file is located:
s3://glue-aa60b120/data
Choose Next.
j) Leave No selected on the Add another data store page. Choose Next.
k) On the Choose an IAM role page. Select Choose an existing IAM role. Choose the AWSGlueServiceRoleDefault from the drop down. Choose Next.
l) Leave Run on demand selected for Frequency. Choose Next. For Database choose nycitytaxi from the drop down. Choose Next. Choose Finish.
Select the nytaxiparquet crawler and choose Run crawler. Once the job finishes you can now see that there are two tables if you choose Databases and Tables. The original csv version from our source bucket. Then the new parquet table in your S3 bucket.
Feel free to look at the tables.
Analyze the data with Amazon Athena
a) Choose Services and search for Athena. You may need to choose Get Started. The Database will show that the nycitytaxi database is selected.
b) Choose the data table at the left which is the parquet file.
c) Paste the following into the query editor:
Select * From nycitytaxi.data limit 10;
Choose Save as.
For the Name and Description paste in the following:
taxidata
Choose Save.
d) Choose set up a query result location in Amazon S3 at the top. Again paste in your bucket:
s3://glue-aa60b120/sql
Choose Save.
e) Choose Run query.
You can now browse the results and see information such as the passenger_count, trip_distance and tip_amount.