This is a very useful add-on report for a Clustering project. You will get the distribution of the Attributes per Cluster in a summarised Pandas Dataframe.
Generate Data
Let’s assume that we came up with 4 clusters such as “0, 1, 2 and 3” and that we have 2 attributes such as:
Age: [<30], [30-65], [65+]
Gender: f, m
import pandas as pd import numpy as np df=pd.DataFrame( { 'Clusters':np.random.choice(["0","1",'2','3'],200,p=[0.3,0.2,0.2,0.3]), 'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]), 'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]), "Response":np.random.binomial(1,size=200,p=0.2) } ) df=df.reset_index().rename(columns={'index':'id'}) df.head()
id Clusters Gender Age
0 0 0 f [30-65]
1 1 3 m [30-65]
2 2 3 f [65+]
3 3 3 m [<30]
4 4 0 m [30-65]
Report the Distribution of Attributes
features=['Gender','Age'] dist=pd.DataFrame() for i in features: print(i) x=df.groupby(['Clusters',i])['id'].nunique().reset_index() x=x.pivot_table(columns='Clusters',index=i,values='id') x=x.apply(lambda x:x/x.sum(),axis=1) x['feature']=i x=x.reset_index().rename(columns={i:'value'})[['feature','value','0','1','2','3']] dist=dist.append(x) dist
Clusters feature value 0 1 2 3
0 Gender f 0.246753 0.220779 0.142857 0.389610
1 Gender m 0.292683 0.170732 0.227642 0.308943
0 Age [30-65] 0.225225 0.198198 0.225225 0.351351
1 Age [65+] 0.416667 0.250000 0.125000 0.208333
2 Age [<30] 0.307692 0.153846 0.169231 0.369231