# How to Report the Distribution of Attributes per Cluster

Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.

## Generate the Data

Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:

• Gender: “M”, “F”
• Type: “A”, “B”, “C”, “D”
• Category: “High”, “Medium”, “Low”
```library(tidyverse)

set.seed(5)

df1<-tibble(ID=seq_len(500))%>%
mutate(Cluster = "C1",
Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)),
Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)),
Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3)))

df2<-tibble(ID=seq_len(300))%>%
mutate(Cluster = "C2",
Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)),
Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)),
Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1)))

df3<-tibble(ID=seq_len(200))%>%
mutate(Cluster = "C3",
Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)),
Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)),
Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7)))

df<-rbind.data.frame(df1, df2, df3)

df

```
``````# A tibble: 1,000 x 5
ID Cluster Gender Type  Category
<int> <chr>   <chr>  <chr> <chr>
1     1 C1      M      C     Medium
2     2 C1      F      C     Medium
3     3 C1      F      C     Medium
4     4 C1      M      B     Low
5     5 C1      M      B     Low
6     6 C1      F      C     Medium
7     7 C1      M      C     Medium
8     8 C1      F      B     High
9     9 C1      F      C     Medium
10    10 C1      M      A     Medium
# ... with 990 more rows``````

## Report the Distribution of Attributes

```attributes <- names(df[3:dim(df)[2]])

output<-NULL

for (a in attributes) {

tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>%
group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>%
ungroup()%>%select(-n)%>%
colnames(tmp)[1:2]<-c("attribute", "values")

output<-rbind(output, tmp)

}

output

```
``````# A tibble: 9 x 5
attribute values    C1    C2    C3
<chr>     <chr>  <dbl> <dbl> <dbl>
1 Gender    F      0.398 0.593 0.78
2 Gender    M      0.602 0.407 0.22
3 Type      A      0.188 0.413 0.425
4 Type      B      0.318 0.1   0.365
5 Type      C      0.39  0.193 0.105
6 Type      D      0.104 0.293 0.105
7 Category  High   0.114 0.683 0.065
8 Category  Low    0.312 0.103 0.75
9 Category  Medium 0.574 0.213 0.185``````

Another approach will be to work with the `map` functions from the `purrr` package. The issue with this solution is that it does not report the attributes/levels to different columns.

```df %>%
split(.\$Cluster) %>%
map(select, -c(ID, Cluster)) %>%
map_depth(2, . %>% table %>% prop.table) %>%
map(unlist) %>%
data.frame

```

And we get:

``````                   C1        C2    C3
Gender.F        0.398 0.5933333 0.780
Gender.M        0.602 0.4066667 0.220
Type.A          0.188 0.4133333 0.425
Type.B          0.318 0.1000000 0.365
Type.C          0.390 0.1933333 0.105
Type.D          0.104 0.2933333 0.105
Category.High   0.114 0.6833333 0.065
Category.Low    0.312 0.1033333 0.750
Category.Medium 0.574 0.2133333 0.185``````

2 thoughts on "How to Report the Distribution of Attributes per Cluster"

1. Maybe it is good example to introduce split-map technique?

> df %>%
split(.\$Cluster) %>%
map(select, -c(ID, Cluster)) %>%
map_depth(2, . %>% table %>% prop.table) %>%
map(unlist) %>%
data.frame

• Thank you Lukasz, Great example! I will add it!

