Predictive Hacks

Tutorial in Bash (Unix Shell) Scripting

bash

Bash is a concise, superfast, and robust scripting language for data and file manipulation. It’s a vital skill for building analytics pipelines in the cloud, favored by Linux users to work with data stored across multiple files. In this post, we provide practical examples of how you can create your own Bash functions to automate your job as a Data Scientist.

This tutorial is based totally on the DataCamp Course which I found really helpful and important for every Data Scientist.

For this tutorial, it may be helpful to have a look at some basic Unix Commands, at some Basic Unix Examples as well as at a Cron Job Example.

“Hello World” Bash Script Example

Usually, a bash script starts with a shebang #!/usr/bash which indicates to execute the file using the Bash Shell. Let’s say that we want to run a Bash Script that prints “Hello World” and “Good Night World”.

#!/usr/bash
echo "Hello World"
echo "Good Night World"
 

The script file can be saved as myscript.sh where the file extension .sh indicates that this is a Bash Script. Technically not needed if the first line has the shebang and path to Bash. It can be run in the terming using the bash myscript.sh or if you have mentioned the first line the !/usr/bash you can simply type ./scipt_name.sh . Getting back to our example. In the terminal, we type:

bash myscript.sh

and we get:

Hello World
Good Night World

Let’s give another example, using the pipe commands. Assume that we have a file called employees.csv with the UserID of the Employees and their Department. Like (top 5 lines):

UserID,Department
1234,Data_Science
1235,Project_Management 
1236,Data_Sicence
1237,Human_Resources

and we want to create a Bash Script that returns the number of employees by department. Let’s call our script group.sh and will be as follows:

#!/usr/bash
cat employees.csv | cut -d "," -f 2| tail -n +2 | sort | uniq -c | sort -nr
 
bash group.sh
10 Data_Science
5 Project_Management
3 Human_Resources

STDIN-STDOUT-STDERR

In Bash scripting, there are 3 ‘streams’

  • STDIN (standard input). A stream of data into the program.
  • STOUT (standard output). A stream of data out of the program. It has an exit code value of 0 for success and 1 for failure.
  • STDERR (standard error). Errors in your program. It has a value of 2.

How to use exit codes in scripts

#!/bin/bash

cat file.txt 

if [ $? -eq 0 ]
then
  echo "The script ran ok"
  exit 0
else
  echo "The script failed" >&2
  exit 1
fi

Arguments

Bash scripts take arguments specified when making the execution call fo the script. ARGV is a term to describe all the arguments that are fed into the script. Each argument can be accessed via the $ notation. The first as $1, the second as $2 and so on. Finally. the [email protected] and $* give all the arguments in ARGV and the $# gives the length (i.e. the number) of arguments.

Example. Let’s consider the ex.sh:

#!/usr/bash
# Echo the first and second ARGV arguments
echo $1
echo $2

# Echo out the entire ARGV array
echo [email protected]

# Echo out the size of ARGV
echo "There are " $# " arguments"
 

And let’s run:

bash args.sh one two three four five

We get:

one
two
one two three four five
There are 5 arguments

Basic Variables in Bash

Similar to other languages, you can assign variables with the equals notation like:

# do not use space between equal sign
firstanme='George'
lastname='Pipis'
# you can reference them with the $
echo "Hello" $firstname $lastname
Hello George Pipis

In Bash, using different quotation marks can mean different things. Both when creating variables and printing.

  • Single quotes: Shell interprets what is between literally
  • Double quotes: Shell interprets literally except using $ and bacticks “. In other words the double quotes understand whatever after a dollar sign as a variable.
  • Backticks: Creates a shell within a shell. Thus, shell runs the command and captures STDOUT back into a variable. A good example is with the command date. For example, you can type:
datenow='the date now is `date`'

echo datenow
The date is Sat Apr  4 20:16:43 DST 2020

An alternative way to call as shell within a shell, and in our example the date is by typing:

datenow="the date now is $(date)"

echo datenow

Numeric Variables in Bash

We can add two integers with the following ways:

expr 4 + 6
echo $((4+6))
# only the bc method allows operations with decimal numbers
echo "4+6" | bc

Example: Build a bash script which takes as an argument the temperature in Fahrenheit and it converts it to Celsius using the formula:

C = (F – 32) x (5/9)

Let’s call our script script.sh and get the degrees in Celsius of 100 Fahrenheit

# Get first ARGV into variable
temp_f=$1

# Subtract 32
temp_f2=$(echo "scale=2; $temp_f - 32" | bc)

# Multiply by 5/9 and print
temp_c=$(echo "scale=2; $temp_f2 * 5 / 9" | bc)

echo $temp_c

Call the script:

bash script.sh 100

And the output that we get:

37.77

Arrays in Bash

There are two types of arrays in Bash:

  • An array
    • ‘Normal’ numerical-indexed structure.
    • Called a ‘list’ in Python or ‘vector’ in R.
  • An associative array
    • Similar to normal array, but with key values pairs, not numerical indexes
    • Similar to Python’s dictionary or R’s list
    • Note: This is available in Bash 4 onwards

You can create an array in two ways:

# declare without adding elements
declare -a my_array

# create and add elements at the same time
my_array=(1 2 3)

Remember- no spaces round equal sign and no commas between elements!

You can return all array elements using array[@] . Note that Bash requires curly brackets around the array name when you want to access these properties. You can access an array element using square brackets. Notice that bash uses zero-indexing for arrays. You can append arrays using array+=(elements). Example:

my_array=(1 2 3 10)
echo ${my_array[@]}

# the length of the array
echo ${#my_array[@]}

# access the four element of the array
echo ${my_array[3]}

# change the first element of the array to 5
my_array[0]=5
echo ${my_array[0]} 

# slice an array N is the starting idex and M is how many elements to return my_array[@]:N:M
echo ${my_array[@]:1:2} 

# append an element
my_array+=(20)
echo ${my_array[@]}
1 2 3 10
4
10
5
2 3 10
5 2 3 10 20

Examples with associate arrays:

# Create empty associative array
declare -A city_details
city_details=([city_name]="New York" [population]=10000000)

echo ${city_details[city_name]} #index using key to return a value

# return all the keys
echo ${!city_details[@]}

New York
city_name population

IF statement

A basic IF statement in Bash has the following structure:

if [ condition ]; then
    # some code
else
    # some code the else statment is optional
fi

Notice that there are spaces between square brackets and conditional elements and there is a semi-colon after close-bracket ];

  • = is equal
  • != is not equal
  • > is greater
  • < is less
  • -eq is equal
  • -ne is not equal
  • -lt is less than
  • -le is less than or equal to
  • -gt is greater than
  • -ge is greater than or equal to
  • -e if the file exists
  • -s if the file exists and has a size greater than zero
  • -r if the file exists and is readable
  • -w if the file exists and is writable
  • $$ for AND
  • || for OR

For more conditional flags you can have a look here.

Exercise

Assume that there is a folder called employees/ with a text file for every employee of the form:

Name: George Pipis
Hiring Date: 2010
Salary: 30000
Department: Data Science

You want to write a script which will take as an input the corresponding txt of the employee and will move it to a new folder according to Hiring Date. If the Hiring Date is greater or equal to 2018 then will go to the new_employee/ folder, else to the old_employee/ folder.

Run the script for the George_Pipis.txt

Solution

The script.sh

# Extract Hirining Date from first ARGV element
hd=$(grep Date $1 | sed 's/[^0-9]*//g')

# Conditionally move into new_emplyees folder
if [ $hd-ge 2018 ]; then
    mv $1 new_employees/
fi

# Conditionally move into old_emplyees folder
if [ $hd-lt 2018]; then
    mv $1 old_employees/
fi

And we run it:

bash script.sh employees/George_Pipis.txt

FOR Loop in Bash

The basic structure in Bash is:

for x in 1 2 3
do
    echo $x
done
1
2
3

You can also use the brace expansion which is {START..STOP..INCREMENT}

for x in {1..5..2}
do
    echo $x
done
1
3
5

There is also the three expression syntaxt

for ((x=2;x<=4;x+=2))
do
    echo $x
done
2
4

Glob expansions

for employee in employees/*
do
    echo $employee
done

Shell-within-a-shell to FOR loop

We could loop through the result of a call to shell-within-a-shell:

for employee in $(ls empoyees/ | grep -i 'georg')
do
    echo $employee
done


George_Pipis_txt
George_Papadopoulos.txt
John_Georgiades.txt

Exercise

You want to write a script where it prints out all the files which end with .py from the folder my_code/

Solution

for file in my_code/*.py
do
    echo $file
done

Exercise

You have many Python scripts in a folder called my_python_tests/ and some of them are about text mining tasks. You can assume that all Python scripts which contain the code import re can be classified as text mining tasks. Your task is to write a Bash script which will move all the text mining scripts to a folder called text_mining/.

Solution

The script.sh can be as follows:

# Create a FOR statement on files in directory
for file in my_python_tests/*.py
do  
    # Create IF statement using grep
    if grep -q 'import re' $file ; then
        # Move wanted files to text_mining/ folder
        mv $file text_mining/
    fi
done

WHILE statement syntax

Iterations continue until this is no longer hold.

  • Use the word while instead of for
  • Surround the condition in square brackets
    • Use of same flags for numerical comparison from IF statements
  • Multiple conditions can be chained or use double-brackets just like IF statements along with $$ , ||
  • Ensure your loop terminates!
x=1
while [ $x -le 3 ];
do
    echo $x
    ((x+=1))
done
1
2
3

Build a CASE statement

The structure of the CASE statement is:

case 'STRING' in
    PATTERN1)
    COMMAND1;;
    PATTERN2)
    COMMAND2;;
    *)
    DEFAULT COMMAND;;
esac

Let’s write a Bash script script.sh which will take as an input a string and will return this it is a weekday, a weekend or not a day.

# Create a CASE statement matching the first ARGV element
case $1 in
  # Match on all weekdays
  Monday|Tuesday|Wednesday|Thursday|Friday)
  echo "It is a Weekday!";;
  # Match on all weekend days
  Saturday|Sunday)
  echo "It is a Weekend!";;
  # Create a default
  *) 
  echo "Not a day!";;
esac

bash script.sh Monday
It is a Weekday!

Exercise

The folder my_r_scripts contains R script from R Machine learning models. Our task is to move the tree based models to a file called tree_models/ and to remove the KNN and Logistic regression models.

Solution

We will use the CASE statement

# Use a FOR loop for each file in 'model_out/'
for file in my_r_scripts /*
do
    # Create a CASE statement for each file's contents
    case $(cat $file) in
      # Match on tree and non-tree models
      *"Random Forest"*|*GBM*|*XGBoost*)
      mv $file tree_models/ ;;
      *KNN*|*Logistic*)
      rm $file ;;
      # Create a default
      *) 
      echo "Unknown model in $file" ;;
    esac
done


Functions in Bash

The syntax of Bash functions is:

function name () {
    # your code
    return #return something
    }

Or alternatively

function function_nmae {
    # your code
    return #something
    }

Example: Let’s write a function that iterates over python files in a folder and prints out their names.

# Create function
function print_file_names () {
  # Loop through files with glob expansion
  for file in myfolder/*.py
  do
    # Print the file name
    echo "I found this file: $file in the folder"
  done
}

# Call the function
print_file_names 


Exercise

Write a Bash function in a bash script which returns the current day.

Solution

# Create function
what_day_is_it () {

  # Parse the results of date
  current_day=$(date | cut -d " " -f1)

  # Echo the result
  echo $current_day
}

# Call the function
what_day_is_it

bash script.sh
Sun

Passing arguments into Bash functions

You can pass arguments into functions using the $1 , $2 notation as we saw earlier. Also, you can use the [email protected], $* and $#

Example

Create a function that takes as inputs two numbers and it returns their ratio.

# Create a function 
function return_ratio() {

  # Calculate the percentage using bc
  ration=$(echo "scale=4; $1 / $2" | bc)

  # Return the calculated ratio
  echo $ration
}

# Call the function with 456 and 632 and echo the result
return_test=$(return_ratio 456 632)
echo "456 out of 632 as a ratio is $return_test"


Example

Write a function that sums up an array:

# Create a function with a local base variable
function sum_array () {
  local sum=0
  # Loop through, adding to base variable
  for number in "[email protected]"
  do
    sum=$(echo "$sum + $number" | bc)
  done
  # Echo back the result
  echo $sum
  }
# Call function with array
test_array=(14 12 23.5 16 19.34)
total=$(sum_array "${test_array[@]}")
echo "The sum of the test array is $total"
bash script.sh
The sum of the test array is 84.84

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

How to Save & Read a Pandas Dataframe Containing Lists and Dictionaries
Python

Pandas GroupBy Tips

This post is a short tutorial in Pandas GroupBy. As always we will work with examples. Let’s create a dummy