Predictive Hacks

How to create tables and query data in Redshift Spectrum from S3

aws

In this tutorial, we will show you how to create several tables in Redshift Spectrum from data stored in S3. Finally, we will perform queries on the tables that we have created. Note that Redshift Spectrum is similar to Athena, since both services are for running SQL queries on S3 data.

Redshift Service

The first thing that we need to do is to go to Amazon Redshift and create a cluster. In my case, the Redshift cluster is running.

S3 Bucket

We will need to get the data from S3. For this example, we have used the following bucket but we provide you the data which are in json format.

Create the Database and the Tables

What we need to do is to go to Redshift Cluster, and then go to the SQL Editor and then click on the “Connect to database“.

Then:

And voilà our database:

Create External Schema

We can run the following query in order to create an external schema called users_data.

create external schema users_data
from data catalog
database 'users'
iam_role 'arn:aws:iam::282924389374:role/SpectrumRole'
create external database if not exists;

You hit Run and then you will be able to see the schema called users_data which is empty since we have not created any tables yet.

Create External Table

Now let’s create a new external table called names under users_data schema by taking data from S3.

create external table users_data.names(
  id_name varchar(32),
  id_value varchar(64),
  gender varchar(16),
  name_title varchar(32),
  name_first varchar(64),
  name_last varchar(64)
)
ROW FORMAT SERDE
    'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://users-data-282924389374';

Note that our data are in JSON format, that is why we use the SERDE serialization. Once we run the query, the external table “names” is available under our schema.

Let’s confirm that the data actually exists by running a “select *” statement.

We will follow the same logic by creating other tables, such as location, age, contact and picture.

create external table users_data.location(
    id_name varchar(32),
    id_value varchar(32),
    location_street_number int,
    location_street_name varchar(64),
    location_city varchar(32),
    location_state varchar(32),
    location_country varchar(32),
    location_postcode varchar(32),
    location_coordinates_latitude varchar(64),
    location_coordinates_longitude varchar(64),
    location_timezone_offset varchar(32),
    location_timezone_description varchar(32),
    nat varchar(16)
)
ROW FORMAT SERDE
    'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://users-data-282924389374';

create external table users_data.age(
    id_name varchar(32),
    id_value varchar(32),
    dob_date varchar(32),
    dob_age int,
    registered_date varchar(32),
    registered_age int
)
ROW FORMAT SERDE
    'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://users-data-282924389374';

create external table users_data.contact(
    id_name varchar(32),
    id_value varchar(32),
    email varchar(32),
    phone varchar(32),
    cell varchar(32)
)
ROW FORMAT SERDE
    'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://users-data-282924389374';

create external table users_data.picture(
    id_name varchar(32),
    id_value varchar(32),
    picture_large varchar(64),
    picture_medium varchar(64),
    picture_thumbnail varchar(64)
)
ROW FORMAT SERDE
    'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://users-data-282924389374';

As we can see, all the tables are under our schema.

Run a Query on Multiple Tables

Now, we will run a query by joining all the tables.

select 
    names.name_first as first_name, 
    names.name_last as last_name, 
    location.location_state as state, 
    age.dob_age as age, 
    contact.cell as cell, 
    picture.picture_large as picture
from users_data.names
    join users_data.location on users_data.names.id_value = users_data.location.id_value 
    join users_data.age on users_data.names.id_value = users_data.age.id_value 
    join users_data.contact on users_data.names.id_value = users_data.contact.id_value
    join users_data.picture on users_data.names.id_value = users_data.picture.id_value
order by age
limit 10;

References

[1] A Cloud Guru

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s