Predictive Hacks

How To Report The Distribution Of Attributes Per Cluster

Photo by Ben Duchac on Unsplash

This is a very useful add-on report for a Clustering project. You will get the distribution of the Attributes per Cluster in a summarised Pandas Dataframe.

Generate Data

Let’s assume that we came up with 4 clusters such as “0, 1, 2 and 3” and that we have 2 attributes such as:

Age: [<30], [30-65], [65+]
Gender: f, m

import pandas as pd
import numpy as np
df=pd.DataFrame(
{
'Clusters':np.random.choice(["0","1",'2','3'],200,p=[0.3,0.2,0.2,0.3]),
'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]),
'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]),
"Response":np.random.binomial(1,size=200,p=0.2)
    }
)
df=df.reset_index().rename(columns={'index':'id'})
df.head()
   id Clusters Gender      Age
0   0        0      f  [30-65]
1   1        3      m  [30-65]
2   2        3      f    [65+]
3   3        3      m    [<30]
4   4        0      m  [30-65]

Report the Distribution of Attributes

features=['Gender','Age']
dist=pd.DataFrame()
for i in features:
    print(i)
    x=df.groupby(['Clusters',i])['id'].nunique().reset_index()
    x=x.pivot_table(columns='Clusters',index=i,values='id')
    x=x.apply(lambda x:x/x.sum(),axis=1)
    x['feature']=i
    x=x.reset_index().rename(columns={i:'value'})[['feature','value','0','1','2','3']]
    dist=dist.append(x)

dist
Clusters feature    value         0         1         2         3
0         Gender        f  0.246753  0.220779  0.142857  0.389610
1         Gender        m  0.292683  0.170732  0.227642  0.308943
0            Age  [30-65]  0.225225  0.198198  0.225225  0.351351
1            Age    [65+]  0.416667  0.250000  0.125000  0.208333
2            Age    [<30]  0.307692  0.153846  0.169231  0.369231

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

snowflake
Miscellaneous

How to Schedule Tasks in Snowflake

We have started a series of Snowflake tutorials, like How to Get Data from Snowflake using Python, How to Load