The goal of this article is to provide some “Numpy Hacks” that will be quite useful during the Data Science Pipeline and especially during the Data Cleansing phase. As always, we will work with reproducible and practical examples. We will work with Pandas and NumPy libraries for our examples.
random
random.seed()
NumPy gives us the possibility to generate random numbers. However, when we work with reproducible examples, we want the “random numbers” to be identical whenever we run the code. For that reason, we can set a random seed with the random.seed()
function which is similar to the random random_state
of scikit-learn package.
random.choice() | random.poisson() | random.rand()
With NumPy we can generate random numbers from distributions like poisson, normal, exponential etc, from the uniform distribution with the random.rand()
and from a sample with the random.choice()
. Let’s generate a Pandas data frame using the random
module.
Example: We will create a pandas data frame of 20 rows and columns such as gender, age and score_a., score_b, score_c
import pandas as pd import numpy as np # set a random seed np.random.seed(5) # gender 60% male 40% female # age from poisson distribution with lambda=25 # score a random integer from 0 to 100 df = pd.DataFrame({'gender':np.random.choice(a=['m','f'], size=20, p=[0.6,0.4]), 'age':np.random.poisson(lam=25, size=20), 'score_a':np.random.randint(100, size=20), 'score_b':np.random.randint(100, size=20), 'score_c':np.random.randint(100, size=20)}) df
random.shuffle()
With the random.shuffle()
we can shuffle randomly the numpy arrays.
# set a random seed np.random.seed(5) arr = df.values np.random.shuffle(arr) arr
logical_and() | logical_or()
I have found the logical_and()
and logical_or()
to be very convenient when we dealing with multiple conditions. Let’s provide some simple examples.
x = np.arange(5) np.logical_and(x>1, x<4)
And we get:
array([False, False, True, True, False])
np.logical_or(x < 1, x > 3)
And we get:
array([ True, False, False, False, True])
where()
The where()
function is very helpful when we want to apply an if else
statement by assigning new values. Let’s say that we want to assign a value equal to “Pass” when the score is higher than 50 and “Fail” when the score is lower than 50. Let’s do it for the score_a
column.
df['score_a_pass'] = np.where(df.score_a>=50,"Pass","Fail") df.head()
select()
If we want to add more conditions, even across multiple columns then we should work with the select()
function. Let’s that that I want to define the following column called demo
as follows:
- if the gender is ‘m’ and the age is below 20 then ‘boy’
- if the gender is ‘m’ and the age is above 20 then ‘mister’
- if the gender is ‘f’ and the age is below 20 then ‘girl’
- if the gender is ‘f’ and the age is above 20 then ‘lady’
- else ‘null’
Let’s see how easily we can do it by using the select()
choices = ['Mister','Lady','Boy', 'Girl'] conditions = [ (df['gender'] == 'm') & (df['age']>20), (df['gender'] == 'f') & (df['age']>20), (df['gender'] == 'm') & (df['age']<=20), (df['gender'] == 'f') & (df['age']<=20) ] df['demo'] = np.select(conditions, choices, default=np.nan) df.head(10)
Note that we could have used the logical_and()
in the conditions.
digitize()
Many times, we want to bucketize our data into bins. We have explained how to create bins with Pandas. Let’s see how we can do it with NumPy. Let’s say that I can to create 5 bins from the score_a
.
bins = np.array([0, 20, 40, 60, 80, 100]) df['Bins'] = np.digitize(df.score_a, bins) df.head(10)
split()
You can also split the NumPy arrays into parts. Let’s say that you want to create a train (60%), validation (20%) and test (20%) datasets.
data_a, data_b, data_c = np.split(df.values, [int(0.6 * len(df.values)), int(0.8*len(df.values))]) data_a
data_b
data_c
clip()
Sometimes we can set a range for the values and if they are outside this interval to get the minim and the maximum value respectively. Let’s assume that we want the data to take values from 0 to 100 and in our dataset, we have values below zero and values above zero. Let’s see the example below:
x = np.array([30, 20, 50, 70, 50, 100, 10, 130, -20, -10, 200]) np.clip(x,0,100)
As we can see below, the negative values became 0 and the values above 100 became 100:
array([ 30, 20, 50, 70, 50, 100, 10, 100, 0, 0, 100])
extract()
Let’s say that we want to extract values that satisfy some conditions. Assume that on the previous example, we wanted to get the values which are less than 0 or greater than 100:
np.extract( (x>100) | (x<0), x )
And we get:
array([130, -20, -10, 200])
unique()
The unique() function returns the unique values but we can use it to get a “value counts” of each element. For example:
# How to count the unique values of an array x = np.array([0,0,0,1,1,1,0,0,2,2]) unique, counts = np.unique(x, return_counts=True) dict(zip(unique, counts))
And we get:
{0: 5, 1: 3, 2: 2}
argmax() | argmin() | argsort() | argpartition()
These functions are values useful. The argmax()
and argmin()
return the index of the max and min element respectively. Let’s say that we want to know the index of the data frame of the maximum score_a
np.argmax(np.array(df.score_a))
and we get 17
. Now we can get the whole 17-th row of the data frame
df.iloc[np.argmax(np.array(df.score_a))]
The argsort()
sorts the NumPy array and returns the indexes. Let’s say that I want to sort the score_a
column:
df.iloc[np.argsort(np.array(df.score_a))]
If we want to get the N largest value index, then we can use the argpartition()
. Let’s say that we want to get the top 5 elements of the following array:
x = np.array([30, 20, 50, 70, 50, 100, 10, 130, -20, -10, 200]) indexes = np.argpartition(x, -5)[-5:] indexes
array([ 2, 3, 5, 7, 10], dtype=int64)
x[indexes]
array([ 50, 70, 100, 130, 200])
Let’s provide a final example by assuming the following scenario. Let’s say that we want to create three columns such as Top1, Top2 and Top3 for each row based on the scores of columns score_a, score_b and score_c. In other words, at which exam there was the highest score, then the second higher and finally the third higher. We can work with the argsort()
:
Tops =pd.DataFrame(df[['score_a','score_b','score_c']].\ apply(lambda x:list(df[['score_a','score_b','score_c']].\ columns[np.array(x).argsort()[::-1][:3]]), axis=1).\ to_list(), columns=['Top1', 'Top2', 'Top3']) Tops
Sum-Up
NumPy is a very popular and strong library. It is very fast and compatible with all AI and ML libraries like Scikit-Learn, TensorFlow etc. Thus, is it very important for every Data Scientist to be competent with NumPy. If you like this article, then you may like the tips about NumPy arrays.