Predictive Hacks

How to Report the Distribution of Attributes per Cluster

reshape

Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.

Generate the Data

Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:

  • Gender: “M”, “F”
  • Type: “A”, “B”, “C”, “D”
  • Category: “High”, “Medium”, “Low”
library(tidyverse)

set.seed(5)

df1<-tibble(ID=seq_len(500))%>%
     mutate(Cluster = "C1",
            Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)),
            Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)),
            Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3)))

df2<-tibble(ID=seq_len(300))%>%
  mutate(Cluster = "C2",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1)))

df3<-tibble(ID=seq_len(200))%>%
  mutate(Cluster = "C3",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7)))

df<-rbind.data.frame(df1, df2, df3)

df
 
# A tibble: 1,000 x 5
      ID Cluster Gender Type  Category
   <int> <chr>   <chr>  <chr> <chr>   
 1     1 C1      M      C     Medium  
 2     2 C1      F      C     Medium  
 3     3 C1      F      C     Medium  
 4     4 C1      M      B     Low     
 5     5 C1      M      B     Low     
 6     6 C1      F      C     Medium  
 7     7 C1      M      C     Medium  
 8     8 C1      F      B     High    
 9     9 C1      F      C     Medium  
10    10 C1      M      A     Medium  
# ... with 990 more rows

Report the Distribution of Attributes


attributes <- names(df[3:dim(df)[2]])


output<-NULL

for (a in attributes) {
  
  tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>%
    group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>%
    ungroup()%>%select(-n)%>%
    spread(Cluster, Prop)%>%mutate(Attribute = a)%>%select(Attribute, everything())
  colnames(tmp)[1:2]<-c("attribute", "values")
  
  output<-rbind(output, tmp)
  
}

output
 
# A tibble: 9 x 5
  attribute values    C1    C2    C3
  <chr>     <chr>  <dbl> <dbl> <dbl>
1 Gender    F      0.398 0.593 0.78 
2 Gender    M      0.602 0.407 0.22 
3 Type      A      0.188 0.413 0.425
4 Type      B      0.318 0.1   0.365
5 Type      C      0.39  0.193 0.105
6 Type      D      0.104 0.293 0.105
7 Category  High   0.114 0.683 0.065
8 Category  Low    0.312 0.103 0.75 
9 Category  Medium 0.574 0.213 0.185

Another approach will be to work with the map functions from the purrr package. The issue with this solution is that it does not report the attributes/levels to different columns.

df %>%
  split(.$Cluster) %>%
  map(select, -c(ID, Cluster)) %>%
  map_depth(2, . %>% table %>% prop.table) %>%
  map(unlist) %>%
  data.frame
 

And we get:

                   C1        C2    C3
Gender.F        0.398 0.5933333 0.780
Gender.M        0.602 0.4066667 0.220
Type.A          0.188 0.4133333 0.425
Type.B          0.318 0.1000000 0.365
Type.C          0.390 0.1933333 0.105
Type.D          0.104 0.2933333 0.105
Category.High   0.114 0.6833333 0.065
Category.Low    0.312 0.1033333 0.750
Category.Medium 0.574 0.2133333 0.185

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

2 thoughts on “How to Report the Distribution of Attributes per Cluster”

  1. Maybe it is good example to introduce split-map technique?

    > df %>%
    split(.$Cluster) %>%
    map(select, -c(ID, Cluster)) %>%
    map_depth(2, . %>% table %>% prop.table) %>%
    map(unlist) %>%
    data.frame

    C1 C2 C3
    Gender.F 0.398 0.5933333 0.780
    Gender.M 0.602 0.4066667 0.220
    Type.A 0.188 0.4133333 0.425
    Type.B 0.318 0.1000000 0.365
    Type.C 0.390 0.1933333 0.105
    Type.D 0.104 0.2933333 0.105
    Category.High 0.114 0.6833333 0.065
    Category.Low 0.312 0.1033333 0.750
    Category.Medium 0.574 0.2133333 0.185

    Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

snowflake
Miscellaneous

How to Schedule Tasks in Snowflake

We have started a series of Snowflake tutorials, like How to Get Data from Snowflake using Python, How to Load