Predictive Hacks

# Tutorial in Bash (Unix Shell) Scripting

Bash is a concise, superfast, and robust scripting language for data and file manipulation. It’s a vital skill for building analytics pipelines in the cloud, favored by Linux users to work with data stored across multiple files. In this post, we provide practical examples of how you can create your own Bash functions to automate your job as a Data Scientist.

This tutorial is based totally on the DataCamp Course which I found really helpful and important for every Data Scientist.

For this tutorial, it may be helpful to have a look at some basic Unix Commands, at some Basic Unix Examples as well as at a Cron Job Example.

## “Hello World” Bash Script Example

Usually, a bash script starts with a shebang #!/usr/bash which indicates to execute the file using the Bash Shell. Let’s say that we want to run a Bash Script that prints “Hello World” and “Good Night World”.

#!/usr/bash
echo "Hello World"
echo "Good Night World"



The script file can be saved as myscript.sh where the file extension .sh indicates that this is a Bash Script. Technically not needed if the first line has the shebang and path to Bash. It can be run in the terming using the bash myscript.sh or if you have mentioned the first line the !/usr/bash you can simply type ./scipt_name.sh . Getting back to our example. In the terminal, we type:

bash myscript.sh


and we get:

Hello World
Good Night World


Let’s give another example, using the pipe commands. Assume that we have a file called employees.csv with the UserID of the Employees and their Department. Like (top 5 lines):

UserID,Department
1234,Data_Science
1235,Project_Management
1236,Data_Sicence
1237,Human_Resources

and we want to create a Bash Script that returns the number of employees by department. Let’s call our script group.sh and will be as follows:

#!/usr/bash
cat employees.csv | cut -d "," -f 2| tail -n +2 | sort | uniq -c | sort -nr


bash group.sh

10 Data_Science
5 Project_Management
3 Human_Resources


## STDIN-STDOUT-STDERR

In Bash scripting, there are 3 ‘streams’

• STDIN (standard input). A stream of data into the program.
• STOUT (standard output). A stream of data out of the program. It has an exit code value of 0 for success and 1 for failure.
• STDERR (standard error). Errors in your program. It has a value of 2.

#### How to use exit codes in scripts

#!/bin/bash

cat file.txt

if [ $? -eq 0 ] then echo "The script ran ok" exit 0 else echo "The script failed" >&amp;2 exit 1 fi  ## Arguments Bash scripts take arguments specified when making the execution call fo the script. ARGV is a term to describe all the arguments that are fed into the script. Each argument can be accessed via the $ notation. The first as $1, the second as $2 and so on. Finally. the [email protected] and $* give all the arguments in ARGV and the $# gives the length (i.e. the number) of arguments.

Example. Let’s consider the ex.sh:

#!/usr/bash
# Echo the first and second ARGV arguments
echo $1 echo$2

# Echo out the entire ARGV array
echo [email protected]

# Echo out the size of ARGV
echo "There are " $# " arguments"  And let’s run: bash args.sh one two three four five  We get: one two one two three four five There are 5 arguments  ## Basic Variables in Bash Similar to other languages, you can assign variables with the equals notation like: # do not use space between equal sign firstanme='George' lastname='Pipis' # you can reference them with the$
echo "Hello" $firstname$lastname

Hello George Pipis


In Bash, using different quotation marks can mean different things. Both when creating variables and printing.

• Single quotes: Shell interprets what is between literally
• Double quotes: Shell interprets literally except using $and bacticks “. In other words the double quotes understand whatever after a dollar sign as a variable. • Backticks: Creates a shell within a shell. Thus, shell runs the command and captures STDOUT back into a variable. A good example is with the command date. For example, you can type: datenow='the date now is date' echo datenow  The date is Sat Apr 4 20:16:43 DST 2020  An alternative way to call as shell within a shell, and in our example the date is by typing: datenow="the date now is$(date)"

echo datenow


## Numeric Variables in Bash

We can add two integers with the following ways:

expr 4 + 6
echo $((4+6)) # only the bc method allows operations with decimal numbers echo "4+6" | bc  Example: Build a bash script which takes as an argument the temperature in Fahrenheit and it converts it to Celsius using the formula: C = (F – 32) x (5/9) Let’s call our script script.sh and get the degrees in Celsius of 100 Fahrenheit # Get first ARGV into variable temp_f=$1

# Subtract 32
temp_f2=$(echo "scale=2;$temp_f - 32" | bc)

# Multiply by 5/9 and print
temp_c=$(echo "scale=2;$temp_f2 * 5 / 9" | bc)

echo $temp_c  Call the script: bash script.sh 100  And the output that we get: 37.77  ## Arrays in Bash There are two types of arrays in Bash: • An array • ‘Normal’ numerical-indexed structure. • Called a ‘list’ in Python or ‘vector’ in R. • An associative array • Similar to normal array, but with key values pairs, not numerical indexes • Similar to Python’s dictionary or R’s list • Note: This is available in Bash 4 onwards You can create an array in two ways: # declare without adding elements declare -a my_array # create and add elements at the same time my_array=(1 2 3)  Remember- no spaces round equal sign and no commas between elements! You can return all array elements using array[@] . Note that Bash requires curly brackets around the array name when you want to access these properties. You can access an array element using square brackets. Notice that bash uses zero-indexing for arrays. You can append arrays using array+=(elements). Example: my_array=(1 2 3 10) echo${my_array[@]}

# the length of the array
echo ${#my_array[@]} # access the four element of the array echo${my_array[3]}

# change the first element of the array to 5
my_array[0]=5
echo ${my_array[0]} # slice an array N is the starting idex and M is how many elements to return my_array[@]:N:M echo${my_array[@]:1:2}

# append an element
my_array+=(20)
echo ${my_array[@]}  1 2 3 10 4 10 5 2 3 10 5 2 3 10 20  Examples with associate arrays: # Create empty associative array declare -A city_details city_details=([city_name]="New York" [population]=10000000) echo${city_details[city_name]} #index using key to return a value

# return all the keys
do
echo $x ((x+=1)) done  1 2 3  ## Build a CASE statement The structure of the CASE statement is: case 'STRING' in PATTERN1) COMMAND1;; PATTERN2) COMMAND2;; *) DEFAULT COMMAND;; esac  Let’s write a Bash script script.sh which will take as an input a string and will return this it is a weekday, a weekend or not a day. # Create a CASE statement matching the first ARGV element case$1 in
# Match on all weekdays
Monday|Tuesday|Wednesday|Thursday|Friday)
echo "It is a Weekday!";;
# Match on all weekend days
Saturday|Sunday)
echo "It is a Weekend!";;
# Create a default
*)
echo "Not a day!";;
esac


bash script.sh Monday

It is a Weekday!


Exercise

The folder my_r_scripts contains R script from R Machine learning models. Our task is to move the tree based models to a file called tree_models/ and to remove the KNN and Logistic regression models.

Solution

We will use the CASE statement

# Use a FOR loop for each file in 'model_out/'
for file in my_r_scripts /*
do
# Create a CASE statement for each file's contents
case $(cat$file) in
# Match on tree and non-tree models
*"Random Forest"*|*GBM*|*XGBoost*)
mv $file tree_models/ ;; *KNN*|*Logistic*) rm$file ;;
# Create a default
*)
echo "Unknown model in $file" ;; esac done  ## Functions in Bash The syntax of Bash functions is: function name () { # your code return #return something }  Or alternatively function function_nmae { # your code return #something }  Example: Let’s write a function that iterates over python files in a folder and prints out their names. # Create function function print_file_names () { # Loop through files with glob expansion for file in myfolder/*.py do # Print the file name echo "I found this file:$file in the folder"
done
}

# Call the function
print_file_names



Exercise

Write a Bash function in a bash script which returns the current day.

Solution

# Create function
what_day_is_it () {

# Parse the results of date
current_day=$(date | cut -d " " -f1) # Echo the result echo$current_day
}

# Call the function
what_day_is_it


bash script.sh

Sun


### Passing arguments into Bash functions

You can pass arguments into functions using the $1 , $2 notation as we saw earlier. Also, you can use the [email protected], $* and $#

Example

Create a function that takes as inputs two numbers and it returns their ratio.

# Create a function
function return_ratio() {

# Calculate the percentage using bc
ration=$(echo "scale=4;$1 / $2" | bc) # Return the calculated ratio echo$ration
}

# Call the function with 456 and 632 and echo the result
return_test=$(return_ratio 456 632) echo "456 out of 632 as a ratio is$return_test"



Example

Write a function that sums up an array:

# Create a function with a local base variable
function sum_array () {
local sum=0
# Loop through, adding to base variable
for number in "[email protected]"
do
sum=$(echo "$sum + $number" | bc) done # Echo back the result echo$sum
}
# Call function with array
test_array=(14 12 23.5 16 19.34)
total=$(sum_array "${test_array[@]}")
echo "The sum of the test array is \$total"

bash script.sh

The sum of the test array is 84.84


### Get updates and learn from the best

Python

#### Estimate Probabilities of Card Games

We are going to show how we can estimate card probabilities by applying Monte Carlo Simulation and how we can

Python

#### Monte Carlo Integration in Python

We will provide examples of how you solve integrals numerically in Python. Let’s recall from statistics that the mean value