Predictive Hacks

# How to Report the Distribution of Attributes per Cluster

Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.

## Generate the Data

Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:

• Gender: “M”, “F”
• Type: “A”, “B”, “C”, “D”
• Category: “High”, “Medium”, “Low”
```library(tidyverse)

set.seed(5)

df1<-tibble(ID=seq_len(500))%>%
mutate(Cluster = "C1",
Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)),
Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)),
Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3)))

df2<-tibble(ID=seq_len(300))%>%
mutate(Cluster = "C2",
Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)),
Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)),
Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1)))

df3<-tibble(ID=seq_len(200))%>%
mutate(Cluster = "C3",
Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)),
Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)),
Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7)))

df<-rbind.data.frame(df1, df2, df3)

df

```
``````# A tibble: 1,000 x 5
ID Cluster Gender Type  Category
<int> <chr>   <chr>  <chr> <chr>
1     1 C1      M      C     Medium
2     2 C1      F      C     Medium
3     3 C1      F      C     Medium
4     4 C1      M      B     Low
5     5 C1      M      B     Low
6     6 C1      F      C     Medium
7     7 C1      M      C     Medium
8     8 C1      F      B     High
9     9 C1      F      C     Medium
10    10 C1      M      A     Medium
# ... with 990 more rows``````

## Report the Distribution of Attributes

```attributes <- names(df[3:dim(df)[2]])

output<-NULL

for (a in attributes) {

tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>%
group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>%
ungroup()%>%select(-n)%>%
colnames(tmp)[1:2]<-c("attribute", "values")

output<-rbind(output, tmp)

}

output

```
``````# A tibble: 9 x 5
attribute values    C1    C2    C3
<chr>     <chr>  <dbl> <dbl> <dbl>
1 Gender    F      0.398 0.593 0.78
2 Gender    M      0.602 0.407 0.22
3 Type      A      0.188 0.413 0.425
4 Type      B      0.318 0.1   0.365
5 Type      C      0.39  0.193 0.105
6 Type      D      0.104 0.293 0.105
7 Category  High   0.114 0.683 0.065
8 Category  Low    0.312 0.103 0.75
9 Category  Medium 0.574 0.213 0.185``````

Another approach will be to work with the `map` functions from the `purrr` package. The issue with this solution is that it does not report the attributes/levels to different columns.

```df %>%
split(.\$Cluster) %>%
map(select, -c(ID, Cluster)) %>%
map_depth(2, . %>% table %>% prop.table) %>%
map(unlist) %>%
data.frame

```

And we get:

``````                   C1        C2    C3
Gender.F        0.398 0.5933333 0.780
Gender.M        0.602 0.4066667 0.220
Type.A          0.188 0.4133333 0.425
Type.B          0.318 0.1000000 0.365
Type.C          0.390 0.1933333 0.105
Type.D          0.104 0.2933333 0.105
Category.High   0.114 0.6833333 0.065
Category.Low    0.312 0.1033333 0.750
Category.Medium 0.574 0.2133333 0.185``````

### 2 thoughts on “How to Report the Distribution of Attributes per Cluster”

1. Maybe it is good example to introduce split-map technique?

> df %>%
split(.\$Cluster) %>%
map(select, -c(ID, Cluster)) %>%
map_depth(2, . %>% table %>% prop.table) %>%
map(unlist) %>%
data.frame

C1 C2 C3
Gender.F 0.398 0.5933333 0.780
Gender.M 0.602 0.4066667 0.220
Type.A 0.188 0.4133333 0.425
Type.B 0.318 0.1000000 0.365
Type.C 0.390 0.1933333 0.105
Type.D 0.104 0.2933333 0.105
Category.High 0.114 0.6833333 0.065
Category.Low 0.312 0.1033333 0.750
Category.Medium 0.574 0.2133333 0.185

• Thank you Lukasz, Great example! I will add it!

### Get updates and learn from the best

Python

#### How to Detect Trends in Cryptocurrencies with ADX using Kraken API

You may have heard the terms “bull” and “bear” markets, which are terms to describe a trend in the market.

Python

#### How to create Heatmap on a Map in Python

In this post, we will show you how to create a heatmap on an actual map using Plotly. What we