How to Report the Distribution of Attributes per Cluster

Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.

Generate the Data

Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:

Gender: “M”, “F”
Type: “A”, “B”, “C”, “D”
Category: “High”, “Medium”, “Low”

library(tidyverse)

set.seed(5)

df1<-tibble(ID=seq_len(500))%>%
     mutate(Cluster = "C1",
            Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)),
            Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)),
            Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3)))

df2<-tibble(ID=seq_len(300))%>%
  mutate(Cluster = "C2",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1)))

df3<-tibble(ID=seq_len(200))%>%
  mutate(Cluster = "C3",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7)))

df<-rbind.data.frame(df1, df2, df3)

df

# A tibble: 1,000 x 5
      ID Cluster Gender Type  Category
   <int> <chr>   <chr>  <chr> <chr>   
 1     1 C1      M      C     Medium  
 2     2 C1      F      C     Medium  
 3     3 C1      F      C     Medium  
 4     4 C1      M      B     Low     
 5     5 C1      M      B     Low     
 6     6 C1      F      C     Medium  
 7     7 C1      M      C     Medium  
 8     8 C1      F      B     High    
 9     9 C1      F      C     Medium  
10    10 C1      M      A     Medium  
# ... with 990 more rows

Report the Distribution of Attributes


attributes <- names(df[3:dim(df)[2]])


output<-NULL

for (a in attributes) {
  
  tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>%
    group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>%
    ungroup()%>%select(-n)%>%
    spread(Cluster, Prop)%>%mutate(Attribute = a)%>%select(Attribute, everything())
  colnames(tmp)[1:2]<-c("attribute", "values")
  
  output<-rbind(output, tmp)
  
}

output

# A tibble: 9 x 5
  attribute values    C1    C2    C3
  <chr>     <chr>  <dbl> <dbl> <dbl>
1 Gender    F      0.398 0.593 0.78 
2 Gender    M      0.602 0.407 0.22 
3 Type      A      0.188 0.413 0.425
4 Type      B      0.318 0.1   0.365
5 Type      C      0.39  0.193 0.105
6 Type      D      0.104 0.293 0.105
7 Category  High   0.114 0.683 0.065
8 Category  Low    0.312 0.103 0.75 
9 Category  Medium 0.574 0.213 0.185

Another approach will be to work with the map functions from the purrr package. The issue with this solution is that it does not report the attributes/levels to different columns.

df %>%
  split(.$Cluster) %>%
  map(select, -c(ID, Cluster)) %>%
  map_depth(2, . %>% table %>% prop.table) %>%
  map(unlist) %>%
  data.frame

And we get:

                   C1        C2    C3
Gender.F        0.398 0.5933333 0.780
Gender.M        0.602 0.4066667 0.220
Type.A          0.188 0.4133333 0.425
Type.B          0.318 0.1000000 0.365
Type.C          0.390 0.1933333 0.105
Type.D          0.104 0.2933333 0.105
Category.High   0.114 0.6833333 0.065
Category.Low    0.312 0.1033333 0.750
Category.Medium 0.574 0.2133333 0.185

Share This Post

2 thoughts on “How to Report the Distribution of Attributes per Cluster”

Łukasz Deryło

January 21, 2021 at 6:50 am

Maybe it is good example to introduce split-map technique?

> df %>%
split(.$Cluster) %>%
map(select, -c(ID, Cluster)) %>%
map_depth(2, . %>% table %>% prop.table) %>%
map(unlist) %>%
data.frame

C1 C2 C3
Gender.F 0.398 0.5933333 0.780
Gender.M 0.602 0.4066667 0.220
Type.A 0.188 0.4133333 0.425
Type.B 0.318 0.1000000 0.365
Type.C 0.390 0.1933333 0.105
Type.D 0.104 0.2933333 0.105
Category.High 0.114 0.6833333 0.065
Category.Low 0.312 0.1033333 0.750
Category.Medium 0.574 0.2133333 0.185
Reply
- George Pipis
  
  January 21, 2021 at 12:18 pm
  
  Thank you Lukasz, Great example! I will add it!
  Reply

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

George Pipis March 21, 2024

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s

George Pipis March 15, 2024

How to Report the Distribution of Attributes per Cluster

Generate the Data

Report the Distribution of Attributes

Share This Post

2 thoughts on “How to Report the Distribution of Attributes per Cluster”

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

How to Report the Distribution of Attributes per Cluster

Generate the Data

Report the Distribution of Attributes

Share This Post

2 thoughts on “How to Report the Distribution of Attributes per Cluster”

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

#Tag Cloud ☁️