When we are dealing with Data Frames, it is quite common, mainly for feature engineering tasks, to change the values of the existing features or to create new features based on some conditions of other columns. Here, we will provide some examples of how we can create a new column based on multiple conditions of existing columns. For these examples, we will work with the titanic
dataset.
Import the data and the libraries
import pandas as pd import numpy as np url = 'https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv' df = pd.read_csv(url, sep="\t") df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Define the Task
Let’s say that we want to create a new column (or to update an existing one) with the following conditions:
- If the
Age
isNaN
andPclass
=1 then the Age=40 - If the
Age
isNaN
andPclass
=2 then the Age=30 - If the
Age
isNaN
andPclass
=3 then the Age=25 - Else the
Age
will remain as is
Solution 1: Using apply and lambda functions
We will need to create a function with the conditions.
def myfunc(age, pclass): if pd.isnull(age) and pclass==1: age=40 elif pd.isnull(age) and pclass==2: age=30 elif pd.isnull(age) and pclass==3: age=25 else: age=age return age
Then, we use the apply
method using the lambda
function which takes as input our function with parameters the pandas columns. Do not forget to set the axis=1
, in order to apply the function row-wise.
df['NewColumn_1'] = df.apply(lambda x: myfunc(x['Age'], x['Pclass']), axis=1)
Solution 2: Using NumPy Select
We can use the NumPy Select
function, where you define the conditions and their corresponding values. For example:
conditions = [df['Pclass'].eq(1) & df['Age'].isnull(), df['Pclass'].eq(2) & df['Age'].isnull(), df['Pclass'].eq(3) & df['Age'].isnull()] choices = [40,30,25] df['NewColumn_2'] = np.select(conditions, choices, default= df['Age'] )
Now let’s see if the Column_1
is identical to Column_2
df['NewColumn_1'].equals(df['NewColumn_2']) True
Let’s have a look also at our new data frame focusing on the cases where the Age
was NaN
.
df[df['Age'].isnull()].head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | NewColumn_1 | NewColumn_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 25.0 | 25.0 |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S | 30.0 | 30.0 |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C | 25.0 | 25.0 |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C | 25.0 | 25.0 |
28 | 29 | 1 | 3 | O’Dwyer, Miss. Ellen “Nellie” | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q | 25.0 | 25.0 |
As we can see, we got the expected output! In case you want to work with R you can have a look at the example.
More Data Science Hacks?
You can follow us on Medium for more Data Science Hacks
1 thought on “Pandas: How to assign values based on multiple conditions of different columns”