Predictive Hacks

Basic Unix Commands for Data Analysts

unix bash shell

Data Analysts/Scientists should have a basic knowledge of Unix Commands, the goal of this post is to give some examples of how the shell commands would help them on their daily tasks. For the first examples we will consider the following eg1.csv:

ID,Name,Dept,Gender
1,George,DS,M
2,Billy,DS,M
3,Nick,IT,M
4,George,IT,M
5,Nikki,HR,F
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F
10,George,DS,M

Examples of Basic Unix Commands


Q: How to print the first or the last 3 rows of the files.

# The first
head -n 3 eg1.csv 

# The last
tail -n 3 eg1.csv

Q: How to skip the first line(s) or the last line(s).

Sometimes we want to skip the first line which usually is the headers. The command is:

# it skips the first line
tail -n +2 eg1.csv

# it skips the last 4 lines
head -n -4 eg1.csv
# skip first line

1,George,DS,M
2,Billy,DS,M
3,Nick,IT,M
4,George,IT,M
5,Nikki,HR,F
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F
10,George,DS,M

Q: How to print the whole file.

# for the whole file
cat eg1.csv

# the first rows - then type space more more or q to quit
less eg1.csv

Q: How to copy a file.

cp eg1.csv copy_eg1.csv

Q: How to rename a file.

mv copy_eg1.csv backup_eg1.csv

Q: How to remove a file.

rm backup_eg1.csv

Q: How to get a list information about files in the working directory.

ls -lh

Q: How to check free disk space.

df -h

Q: How to get how much space one ore more files or directories is using.

du -sh

Q: How can I select columns form a file.

If you want to select columns, you can use the command cut. It has several options (use man cut to explore them), but the most common is something like:

cut -f 1-2,4 -d , eg1.csv

This means “select columns 1 through 2 and columns 4, using comma as the separator”. cut uses -f (meaning “fields”) to specify columns and -d (meaning “delimiter”) to specify the separator.

This commad returns:

ID,Name,Gender
1,George,M
2,Billy,M
3,Nick,M
4,George,M
5,Nikki,F
6,Claudia,F
7,Maria,F
8,Jimmy,M
9,Jane,F
10,George,M

Q: How can I exclude a column

In order to exclude a column or columns we do the opposite of selecting columns by adding the –complement. For instance let’s say that we want to exclude the second column. See other ideas here

cut --complement -f 2 -d , eg1.csv 

Q: How can I select lines containing specific values.

For example let’s say that we want to select all lines which containing the value “Sales”. Then the command is:

grep Sales eg1.csv
7,Maria,Sales,F
8,Jimmy,Sales,M

Q: How can I store a command’s output in a file.

Let’s say that I want to get the second column (i.e Name) from the eg1.csv and store it to a new file called names.txt. The > tells the shell to redirect command output to a file.

cut -f 2 -d , eg1.csv > names.txt

Q: How to combine commands.

The pipe | symbol tells the shell to use the output of the command on the left as the input to the command on the right. Let’s see the following example where we want to exlude the headers from the names.txt file.

cut -f 2 -d , eg1.csv | tail -n +2 > names_without_header.txt

Or we can take a subset of lines of a file. For example:

head -n 5 eg1.csv | tail -n -3
2,Billy,DS,M
3,Nick,IT,M
4,George,IT,M

Q: How to count the number of lines in a file.

wc -l eg1.csv
11 eg1.csv

Q: How can I specify many files at once.

Assume that in the tmp folder we have some csv files and we want to get the first column of all of them.

cut -d , -f 1 tmp/*.csv

Q: How can I sort lines of text.

Let’s say that I want to sort the names of the eg1.xt file. Thus I have to choose the first column and to exclude the header which is called “name”.

cut -f 2 -d , eg1.csv | grep -v Name | sort
Billy
Claudia
George
George
George
Jane
Jimmy
Maria
Nick
Nikki

Q: How can I take the unique lines.

uniq command removes adjacent duplicated lines. This imples that we must first sort the file and then to run the uniq command. For example let’s take the unique names from the names_without_header.txt.

sort names_without_header.txt | uniq
Billy
Claudia
George
Jane
Jimmy
Maria
Nick
Nikki

Q: How to do a “value counts”.

We can combine the sort and uniq -c commands. The following command returns the number of employees by department.

cut -f 3 -d , eg1.csv | grep Dept -v | sort | uniq -c

and we get:

      3 DS
      2 HR
      2 IT
      1 Marketing
      2 Sales

Q: How to find the location of a file(s) within all directories contained wihtin that directory.

The first argument is then followed by a flag that describes the method you want to use to search. In this case we’ll only be searching for a file by its name, so we’ll use the -name flag. The -name flag itself then takes an argument, the
name of the file that you’re looking for.

# search for the randomfile.txt
find . -name randomfile.txt

# now let’s try searching for all .jpg files:
find . -name *.jpg

Q: How to compress/decompress files.

# compress files to a zip file
zip zipped.zip file1 file2 file3

# to uncompress a zip file
unzip zipped.zip

# compress to a tar file
tar -zcvf myfile.tgz .

# decompress tar file
tar -zxvf myfile.tgz

# To extract a file compressed with gunzip, type the following
gunzip filename_tar.gz
tar xvf filename_tar

# compress a file using gzip
gzip filename

# decompress the filename
gzip -d filename.gz
# or
gunzip filename.gz

Here you can find a cheat-sheet

Q: Difference between grep, egrep, fgrep

You can have a look at unix.stackexchange.

Q: How to dowload files from remote locations.

We can use the wget command. For example let’s download the “iris.csv”.

wget https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
--2019-08-05 13:57:02--  https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3716 (3.6K) 
Saving to: ‘iris.csv’

iris.csv                      100%[=================================================>]   3.63K  --.-KB/s    in 0.001s

2019-08-05 13:57:02 (3.50 MB/s) - ‘iris.csv’ saved [3716/3716]
# head -n 5 iris.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

A brief description of sed “command”

Q: How to display line multiple times.

# displays the third line twice
sed '3p' eg1.csv
ID,Name,Dept,Gender
1,George,DS,M
2,Billy,DS,M
2,Billy,DS,M
3,Nick,IT,M
4,George,IT,M
5,Nikki,HR,F
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F
10,George,DS,M

Q: How to display a specific line.

# it displays only the third line
sed -n '3p' eg1.csv
2,Billy,DS,M

Q: How to display a last line of a file.

sed -n '$p' eg1.csv
10,George,DS,M

Q: How to display a range of lines

# it prints the 2nd up to 4th line
sed -n '2,4p' eg1.csv
1,George,DS,M
2,Billy,DS,M
3,Nick,IT,M

Q: How NOT to display a specific line or a range of lines.

# all except 2nd line
sed -n '2!p' eg1.csv

# all except 2nd up 4th lines
sed -n '2,4!p' eg1.csv
# all except 2nd up 4th lines
ID,Name,Dept,Gender
4,George,IT,M
5,Nikki,HR,F
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F
10,George,DS,M

Q: How to display lines by searching a word.

# return any line containing the word "George
sed -n '/George/p' eg1.csv

1,George,DS,M
4,George,IT,M
10,George,DS,M

Q: How to substitute data in file.

# replace "George" to "Georgios"
sed 's/George/Georgios/g' eg1.csv
ID,Name,Dept,Gender
1,Georgios,DS,M
2,Billy,DS,M
3,Nick,IT,M
4,Georgios,IT,M
5,Nikki,HR,F
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F
10,Georgios,DS,M

A brief description of awk “command”

Q: How to print a specific column.

# prints the third column. The dollar sign defines the column and the separator 
# was defined with the -F ","
awk -F "," '{print $3}' eg1.csv

# alternatively
awk  '{print $3}' FS="," eg1.csv



# print the 1st and 3 column. Display separated by tab
awk -F "," '{print $1 "\t" $3}' eg1.csv

# if you want to print all you can write
awk -F "," '{print $0}' eg1.csv
# print the 1st and 3 column. Display separated by tab
# awk -F "," '{print $1 "\t" $3}' eg1.csv

ID     Dept
1       DS
2       DS
3       IT
4       IT
5       HR
6       HR
7       Sales
8       Sales
9       Marketing
10      DS

Q: How to remove header row from result.

# we use the NR which comes from "number of row"
awk 'NR!=1' eg1.csv

# The NR takes also great, less, equal, not equal
# so we get the same results with the NR>1
awk 'NR>1' eg1.csv

Q: How to conditionally select data.

# let's say that we want all the rows where the department is DS
awk -F"," '$3=="DS"{print $0}' eg1.csv

# let's say that we want all the rows where the id is higher than 5
awk -F"," '$1>5{print $0}' eg1.csv

# get all the rows where there is the substring "Ge"
awk -F"," '/Ge/{print $0}' eg1.csv

# get all the rows where there is the substring "Ge" in second column 
awk -F"," '$2~/Ge/{print $0}' eg1.csv

# get all the rows where there is NOT the substring "Ge" in second column 
awk -F"," '$2!~/Ge/{print $0}' eg1.csv

# get all the rows where there is the exact match "George" in second column 
awk -F"," '$2=="George"{print $0}' eg1.csv
# awk -F"," '$3=="DS"{print $0}' eg1.csv
1,George,DS,M
2,Billy,DS,M
10,George,DS,M

# awk -F"," '$1>5{print $0}' eg1.csv
ID,Name,Dept,Gender
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F
10,George,DS,M

# awk -F"," '/Ge/{print $0}' eg1.csv
ID,Name,Dept,Gender
1,George,DS,M
4,George,IT,M
10,George,DS,M

# awk -F"," '$2!~/Ge/{print $0}' eg1.csv
ID,Name,Dept,Gender
2,Billy,DS,M
3,Nick,IT,M
5,Nikki,HR,F
6,Claudia,HR,F
7,Maria,Sales,F
8,Jimmy,Sales,M
9,Jane,Marketing,F

Want to learn more about bash scripting?

If you enjoy this tutorial you have a look at:

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “Basic Unix Commands for Data Analysts”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

letter frequency
Python

Document Letter Frequency in Python

Letter Frequency We will provide you a walk-through example of how you can easily get the letter frequency in documents

[the_ad_group id="232"]
[the_ad id="2133"]