We will show you a typical example of an ETL task using bash scripting. The task includes the following steps:
- Download a gzip file using the
wget
command - Unzip the file using the
gunzip
command - Keep the first 4 columns of the file using the
cut
command - Change the delimiter of the file using the
tr
command - Load the data to a PostgreSQL table using the
copy
command
In this case, the bash script called cp-access-log.sh
is the following:
# cp-access-log.sh # This script downloads the file 'web-server-access-log.txt.gz' # from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/". # The script then extracts the .txt file using gunzip. # The .txt file contains the timestamp, latitude, longitude # and visitor id apart from other data. # Transforms the text delimeter from "#" to "," and saves to a csv file. # Loads the data from the CSV file into the table 'access_log' in PostgreSQL database. # Download the access log file wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/web-server-access-log.txt.gz" # Unzip the file to extract the .txt file. gunzip -f web-server-access-log.txt.gz # Extract phase echo "Extracting data" # Extract the columns 1 (timestamp), 2 (latitude), 3 (longitude) and # 4 (visitorid) cut -d"#" -f1-4 web-server-access-log.txt > extracted-data.txt # Transform phase echo "Transforming data" # read the extracted data and replace the colons with commas. tr "#" "," < extracted-data.txt > transformed-data.csv # Load phase echo "Loading data" # Send the instructions to connect to 'template1' and # copy the file to the table 'access_log' through command pipeline. echo "\c template1;\COPY access_log FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV HEADER;" | psql --username=postgres --host=localhost
Schedule your Script
Assume that you want to schedule this script to run every day using cron. You first need to change the permissions by making the script executable:
chmod +x cp-access-log.sh
Then, open the crontab editor:
crontab -e
and enter the schedule. Let’s say that we want to run the script, every day at 1am:
0 1 * * * path/cp-access-log.sh
Finally, close the editor and save the edits. The new ETL job is now scheduled and running in production!
Sources:
- [1] Coursera