In this tutorial, we will show you how to create several tables in Redshift Spectrum from data stored in S3. Finally, we will perform queries on the tables that we have created. Note that Redshift Spectrum is similar to Athena, since both services are for running SQL queries on S3 data.
Redshift Service
The first thing that we need to do is to go to Amazon Redshift and create a cluster. In my case, the Redshift cluster is running.
S3 Bucket
We will need to get the data from S3. For this example, we have used the following bucket but we provide you the data which are in json format.
Create the Database and the Tables
What we need to do is to go to Redshift Cluster, and then go to the SQL Editor and then click on the “Connect to database“.
Then:
And voilà our database:
Create External Schema
We can run the following query in order to create an external schema called users_data.
create external schema users_data from data catalog database 'users' iam_role 'arn:aws:iam::282924389374:role/SpectrumRole' create external database if not exists;
You hit Run and then you will be able to see the schema called users_data which is empty since we have not created any tables yet.
Create External Table
Now let’s create a new external table called names under users_data schema by taking data from S3.
create external table users_data.names( id_name varchar(32), id_value varchar(64), gender varchar(16), name_title varchar(32), name_first varchar(64), name_last varchar(64) ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://users-data-282924389374';
Note that our data are in JSON format, that is why we use the SERDE serialization. Once we run the query, the external table “names” is available under our schema.
Let’s confirm that the data actually exists by running a “select *” statement.
We will follow the same logic by creating other tables, such as location, age, contact and picture.
create external table users_data.location( id_name varchar(32), id_value varchar(32), location_street_number int, location_street_name varchar(64), location_city varchar(32), location_state varchar(32), location_country varchar(32), location_postcode varchar(32), location_coordinates_latitude varchar(64), location_coordinates_longitude varchar(64), location_timezone_offset varchar(32), location_timezone_description varchar(32), nat varchar(16) ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://users-data-282924389374'; create external table users_data.age( id_name varchar(32), id_value varchar(32), dob_date varchar(32), dob_age int, registered_date varchar(32), registered_age int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://users-data-282924389374'; create external table users_data.contact( id_name varchar(32), id_value varchar(32), email varchar(32), phone varchar(32), cell varchar(32) ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://users-data-282924389374'; create external table users_data.picture( id_name varchar(32), id_value varchar(32), picture_large varchar(64), picture_medium varchar(64), picture_thumbnail varchar(64) ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://users-data-282924389374';
As we can see, all the tables are under our schema.
Run a Query on Multiple Tables
Now, we will run a query by joining all the tables.
select names.name_first as first_name, names.name_last as last_name, location.location_state as state, age.dob_age as age, contact.cell as cell, picture.picture_large as picture from users_data.names join users_data.location on users_data.names.id_value = users_data.location.id_value join users_data.age on users_data.names.id_value = users_data.age.id_value join users_data.contact on users_data.names.id_value = users_data.contact.id_value join users_data.picture on users_data.names.id_value = users_data.picture.id_value order by age limit 10;
References
[1] A Cloud Guru