A “special” data structure in R is the “factors”. We are going to provide some examples of how we can rename and relevel the factors. For the next examples, we will work with the following data
df<-data.frame(ID=c(1:10), Gender=factor(c("M","M","M","","F","F","M","","F","F" )), AgeGroup=factor(c("[60+]", "[26-35]", "[NA]", "[36-45]", "[46-60]", "[26-35]", "[NA]", "[18-25]", "[26-35]", "[26-35]")))
> df
ID Gender AgeGroup
1 1 M [60+]
2 2 M [26-35]
3 3 M [NA]
4 4 [36-45]
5 5 F [46-60]
6 6 F [26-35]
7 7 M [NA]
8 8 [18-25]
9 9 F [26-35]
10 10 F [26-35]
Rename Factors
Let’s say that I want to convert the empty string of Gender to “U” from the Unknown
levels(df$Gender)[levels(df$Gender)==""] ="U"
Let’s say that we want to merge the age groups. For instance the new categories will be “[18-35]”, “[35+], “[NA]”
levels(df$AgeGroup)[levels(df$AgeGroup)=="[18-25]"] = "[18-35]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[26-35]"] = "[18-35]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[36-45]"] = "[35+]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[46-60]"] = "[35+]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[60+]"] = "[35+]"
Notice that we could have done it in once, but it is very risky because sometimes we can have different order than what we expected.
levels(df$AgeGroup)<-c("[18-35]","[18-35]","[35+]","[35+]","[35+]", "[NA]")
By applying the changed we mentioned before, we get the following data.
> df
ID Gender AgeGroup
1 1 M [35+]
2 2 M [18-35]
3 3 M [NA]
4 4 U [35+]
5 5 F [35+]
6 6 F [18-35]
7 7 M [NA]
8 8 U [18-35]
9 9 F [18-35]
10 10 F [18-35]
Relevel Factors
Let’s say that we want the “[NA]” age group to appear first
df$AgeGroup<-factor(df$AgeGroup, c("[NA]", "[18-35]" ,"[35+]"))
Another way to change the order is to use relevel()
to make a particular level first in the list. (This will not work for ordered factors.). Let’s day that we want the ‘F’ Gender first
df$Gender<-relevel(df$Gender, "F")
By applying these changes, we can see how the factors have changed level.
> str(df)
'data.frame': 10 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10
$ Gender : Factor w/ 3 levels "F","U","M": 3 3 3 2 1 1 3 2 1 1
$ AgeGroup: Factor w/ 3 levels "[NA]","[18-35]",..: 3 2 1 3 3 2 1 2 2 2
More Data Science Hacks?
You can follow us on Medium for more Data Science Hacks