Bash is a concise, superfast, and robust scripting language for data and file manipulation. It’s a vital skill for building analytics pipelines in the cloud, favored by Linux users to work with data stored across multiple files. In this post, we provide practical examples of how you can create your own Bash functions to automate your job as a Data Scientist.
This tutorial is based totally on the DataCamp Course which I found really helpful and important for every Data Scientist.
For this tutorial, it may be helpful to have a look at some basic Unix Commands, at some Basic Unix Examples as well as at a Cron Job Example.
“Hello World” Bash Script Example
Usually, a bash script starts with a shebang #!/usr/bash
which indicates to execute the file using the Bash Shell. Let’s say that we want to run a Bash Script that prints “Hello World” and “Good Night World”.
#!/usr/bash echo "Hello World" echo "Good Night World"
The script file can be saved as myscript.sh
where the file extension .sh
indicates that this is a Bash Script. Technically not needed if the first line has the shebang and path to Bash. It can be run in the terming using the bash myscript.sh
or if you have mentioned the first line the !/usr/bash
you can simply type ./scipt_name.sh
. Getting back to our example. In the terminal, we type:
bash myscript.sh
and we get:
Hello World Good Night World
Let’s give another example, using the pipe commands. Assume that we have a file called employees.csv
with the UserID of the Employees and their Department. Like (top 5 lines):
UserID,Department
1234,Data_Science
1235,Project_Management
1236,Data_Sicence
1237,Human_Resources
and we want to create a Bash Script that returns the number of employees by department. Let’s call our script group.sh
and will be as follows:
#!/usr/bash cat employees.csv | cut -d "," -f 2| tail -n +2 | sort | uniq -c | sort -nr
bash group.sh
10 Data_Science 5 Project_Management 3 Human_Resources
STDIN-STDOUT-STDERR
In Bash scripting, there are 3 ‘streams’
- STDIN (standard input). A stream of data into the program.
- STOUT (standard output). A stream of data out of the program. It has an exit code value of 0 for success and 1 for failure.
- STDERR (standard error). Errors in your program. It has a value of 2.
How to use exit codes in scripts
#!/bin/bash cat file.txt if [ $? -eq 0 ] then echo "The script ran ok" exit 0 else echo "The script failed" >&2 exit 1 fi
Arguments
Bash scripts take arguments specified when making the execution call fo the script. ARGV
is a term to describe all the arguments that are fed into the script. Each argument can be accessed via the $
notation. The first as $1
, the second as $2
and so on. Finally. the $@
and $*
give all the arguments in ARGV and the $#
gives the length (i.e. the number) of arguments.
Example. Let’s consider the ex.sh
:
#!/usr/bash # Echo the first and second ARGV arguments echo $1 echo $2 # Echo out the entire ARGV array echo $@ # Echo out the size of ARGV echo "There are " $# " arguments"
And let’s run:
bash args.sh one two three four five
We get:
one two one two three four five There are 5 arguments
Basic Variables in Bash
Similar to other languages, you can assign variables with the equals notation like:
# do not use space between equal sign firstanme='George' lastname='Pipis' # you can reference them with the $ echo "Hello" $firstname $lastname
Hello George Pipis
In Bash, using different quotation marks can mean different things. Both when creating variables and printing.
- Single quotes: Shell interprets what is between literally
- Double quotes: Shell interprets literally except using $ and bacticks “. In other words the double quotes understand whatever after a dollar sign as a variable.
- Backticks: Creates a shell within a shell. Thus, shell runs the command and captures STDOUT back into a variable. A good example is with the command date. For example, you can type:
datenow='the date now is `date`' echo datenow
The date is Sat Apr 4 20:16:43 DST 2020
An alternative way to call as shell within a shell, and in our example the date
is by typing:
datenow="the date now is $(date)" echo datenow
Numeric Variables in Bash
We can add two integers with the following ways:
expr 4 + 6 echo $((4+6)) # only the bc method allows operations with decimal numbers echo "4+6" | bc
Example: Build a bash script which takes as an argument the temperature in Fahrenheit and it converts it to Celsius using the formula:
C = (F – 32) x (5/9)
Let’s call our script script.sh
and get the degrees in Celsius of 100 Fahrenheit
# Get first ARGV into variable temp_f=$1 # Subtract 32 temp_f2=$(echo "scale=2; $temp_f - 32" | bc) # Multiply by 5/9 and print temp_c=$(echo "scale=2; $temp_f2 * 5 / 9" | bc) echo $temp_c
Call the script:
bash script.sh 100
And the output that we get:
37.77
Arrays in Bash
There are two types of arrays in Bash:
- An array
- ‘Normal’ numerical-indexed structure.
- Called a ‘list’ in Python or ‘vector’ in R.
- An associative array
- Similar to normal array, but with key values pairs, not numerical indexes
- Similar to Python’s dictionary or R’s list
- Note: This is available in Bash 4 onwards
You can create an array in two ways:
# declare without adding elements declare -a my_array # create and add elements at the same time my_array=(1 2 3)
Remember- no spaces round equal sign and no commas between elements!
You can return all array elements using array[@]
. Note that Bash requires curly brackets around the array name when you want to access these properties. You can access an array element using square brackets. Notice that bash uses zero-indexing for arrays. You can append arrays using array+=(elements)
. Example:
my_array=(1 2 3 10) echo ${my_array[@]} # the length of the array echo ${#my_array[@]} # access the four element of the array echo ${my_array[3]} # change the first element of the array to 5 my_array[0]=5 echo ${my_array[0]} # slice an array N is the starting idex and M is how many elements to return my_array[@]:N:M echo ${my_array[@]:1:2} # append an element my_array+=(20) echo ${my_array[@]}
1 2 3 10 4 10 5 2 3 10 5 2 3 10 20
Examples with associate arrays:
# Create empty associative array declare -A city_details city_details=([city_name]="New York" [population]=10000000) echo ${city_details[city_name]} #index using key to return a value # return all the keys echo ${!city_details[@]}
New York city_name population
IF statement
A basic IF statement in Bash has the following structure:
if [ condition ]; then
# some code
else
# some code the else statment is optional
fi
Notice that there are spaces between square brackets and conditional elements and there is a semi-colon after close-bracket ];
=
is equal!=
is not equal>
is greater<
is less-eq
is equal-ne
is not equal-lt
is less than-le
is less than or equal to-gt
is greater than-ge
is greater than or equal to-e
if the file exists-s
if the file exists and has a size greater than zero-r
if the file exists and is readable-w
if the file exists and is writable$$
for AND||
for OR
For more conditional flags you can have a look here.
Exercise
Assume that there is a folder called employees/
with a text file for every employee of the form:
Name: George Pipis Hiring Date: 2010 Salary: 30000 Department: Data Science
You want to write a script which will take as an input the corresponding txt of the employee and will move it to a new folder according to Hiring Date. If the Hiring Date is greater or equal to 2018 then will go to the new_employee/
folder, else to the old_employee/
folder.
Run the script for the George_Pipis.txt
Solution
The script.sh
# Extract Hirining Date from first ARGV element hd=$(grep Date $1 | sed 's/[^0-9]*//g') # Conditionally move into new_emplyees folder if [ $hd-ge 2018 ]; then mv $1 new_employees/ fi # Conditionally move into old_emplyees folder if [ $hd-lt 2018]; then mv $1 old_employees/ fi
And we run it:
bash script.sh employees/George_Pipis.txt
FOR Loop in Bash
The basic structure in Bash is:
for x in 1 2 3 do echo $x done
1 2 3
You can also use the brace expansion
which is {START..STOP..INCREMENT}
for x in {1..5..2} do echo $x done
1 3 5
There is also the three expression syntaxt
for ((x=2;x<=4;x+=2)) do echo $x done
2 4
Glob expansions
for employee in employees/* do echo $employee done
Shell-within-a-shell to FOR loop
We could loop through the result of a call to shell-within-a-shell:
for employee in $(ls empoyees/ | grep -i 'georg') do echo $employee done
George_Pipis_txt George_Papadopoulos.txt John_Georgiades.txt
Exercise
You want to write a script where it prints out all the files which end with .py
from the folder my_code/
Solution
for file in my_code/*.py do echo $file done
Exercise
You have many Python scripts in a folder called my_python_tests/
and some of them are about text mining tasks. You can assume that all Python scripts which contain the code import re
can be classified as text mining
tasks. Your task is to write a Bash script which will move all the text mining scripts to a folder called text_mining/
.
Solution
The script.sh
can be as follows:
# Create a FOR statement on files in directory for file in my_python_tests/*.py do # Create IF statement using grep if grep -q 'import re' $file ; then # Move wanted files to text_mining/ folder mv $file text_mining/ fi done
WHILE statement syntax
Iterations continue until this is no longer hold.
- Use the word
while
instead offor
- Surround the condition in square brackets
- Use of same flags for numerical comparison from IF statements
- Multiple conditions can be chained or use double-brackets just like
IF
statements along with$$
,||
- Ensure your loop terminates!
x=1 while [ $x -le 3 ]; do echo $x ((x+=1)) done
1 2 3
Build a CASE statement
The structure of the CASE statement is:
case 'STRING' in PATTERN1) COMMAND1;; PATTERN2) COMMAND2;; *) DEFAULT COMMAND;; esac
Let’s write a Bash script script.sh
which will take as an input a string and will return this it is a weekday, a weekend or not a day.
# Create a CASE statement matching the first ARGV element case $1 in # Match on all weekdays Monday|Tuesday|Wednesday|Thursday|Friday) echo "It is a Weekday!";; # Match on all weekend days Saturday|Sunday) echo "It is a Weekend!";; # Create a default *) echo "Not a day!";; esac
bash script.sh Monday
It is a Weekday!
Exercise
The folder my_r_scripts
contains R script from R Machine learning models. Our task is to move the tree based
models to a file called tree_models/
and to remove the KNN and Logistic regression models.
Solution
We will use the CASE statement
# Use a FOR loop for each file in 'model_out/' for file in my_r_scripts /* do # Create a CASE statement for each file's contents case $(cat $file) in # Match on tree and non-tree models *"Random Forest"*|*GBM*|*XGBoost*) mv $file tree_models/ ;; *KNN*|*Logistic*) rm $file ;; # Create a default *) echo "Unknown model in $file" ;; esac done
Functions in Bash
The syntax of Bash functions is:
function name () { # your code return #return something }
Or alternatively
function function_nmae { # your code return #something }
Example: Let’s write a function that iterates over python files in a folder and prints out their names.
# Create function function print_file_names () { # Loop through files with glob expansion for file in myfolder/*.py do # Print the file name echo "I found this file: $file in the folder" done } # Call the function print_file_names
Exercise
Write a Bash function in a bash script which returns the current day.
Solution
# Create function what_day_is_it () { # Parse the results of date current_day=$(date | cut -d " " -f1) # Echo the result echo $current_day } # Call the function what_day_is_it
bash script.sh
Sun
Passing arguments into Bash functions
You can pass arguments into functions using the $1
, $2
notation as we saw earlier. Also, you can use the $@
, $*
and $#
Example
Create a function that takes as inputs two numbers and it returns their ratio.
# Create a function function return_ratio() { # Calculate the percentage using bc ration=$(echo "scale=4; $1 / $2" | bc) # Return the calculated ratio echo $ration } # Call the function with 456 and 632 and echo the result return_test=$(return_ratio 456 632) echo "456 out of 632 as a ratio is $return_test"
Example
Write a function that sums up an array:
# Create a function with a local base variable function sum_array () { local sum=0 # Loop through, adding to base variable for number in "$@" do sum=$(echo "$sum + $number" | bc) done # Echo back the result echo $sum } # Call function with array test_array=(14 12 23.5 16 19.34) total=$(sum_array "${test_array[@]}") echo "The sum of the test array is $total"
bash script.sh
The sum of the test array is 84.84