Predictive Hacks

# How to determine the number of Clusters for K-Means in R

We will work with the Breast Cancer Wisconsin dataset, where we will apply the K-Means algorithm to the individual’s features ignoring the dependent variable `diagnosis`. Notice that all the features are numeric.

```library(tidyverse)

# the column names of the dataset
'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean',
'concavity_mean','concave_points_mean',
'symmetry_mean', 'fractal_dimension_mean',
'area_se', 'smoothness_se', 'compactness_se',
'concavity_se', 'concave_points_se',
'symmetry_se', 'fractal_dimension_se',
'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst',
'concavity_worst', 'concave_points_worst',
'symmetry_worst', 'fractal_dimension_worst')

# get the data from the URL and assign the column names
```

Before we apply any cluster analysis, we should scale our data. We will remove the `id_number` and the `diagnosis`

```scaled_data<-as.data.frame(scale(df%>%select(-id_number, -diagnosis)))
```

## Elbow Method

In a previous post, we explained how we can apply the Elbow Method in Python. Here, we will use the `map_dbl` to run `kmeans` using the `scaled_data` for `k `values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model. Then we can visualize the relationship using a line plot to create the elbow plot where we are looking for a sharp decline from one k to another followed by a more gradual decrease in slope. The last value of k before the slope of the plot levels off suggests a “good” value of k.

```# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
model <- kmeans(x = scaled_data, centers = k)
model\$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() + geom_point()+
scale_x_continuous(breaks = 1:10)
```

According to the Elbow Method, we can argue that the number of suggested K Clusters are 2.

## Silhouette Analysis

Silhouette analysis allows you to calculate how similar each observation is with the cluster it is assigned relative to other clusters. This metric ranges from -1 to 1 for each observation in your data and can be interpreted as follows:

• Values close to 1 suggest that the observation is well matched to the assigned cluster
• Values close to 0 suggest that the observation is borderline matched between two clusters
• Values close to -1 suggest that the observations may be assigned to the wrong cluster

We can determine the number of clusters K using the average silhouette width. We pick the K which maximizes that score.

```# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
model <- pam(x = scaled_data, k = k)
model\$silinfo\$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() + geom_point() +
scale_x_continuous(breaks = 2:10)
```

As we can see from the plot above, the “Best” k is 2

## Gap Statistic

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering). The reference dataset is generated using Monte Carlo simulations of the sampling process

```library(factoextra)
library(cluster)
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(scaled_data, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)

fviz_gap_stat(gap_stat)

```

Again, according to the Gap Statistic, the optimum number of clusters is the k=2.

## All 3 methods in one package

Let’s see how we can produce the same analysis for the three methods above with a few lines of coding!

```library(factoextra)
library(NbClust)

# Elbow method
fviz_nbclust(scaled_data, kmeans, method = "wss") +
geom_vline(xintercept = 2, linetype = 2)+
labs(subtitle = "Elbow method")

# Silhouette method
fviz_nbclust(scaled_data, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")

# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(scaled_data, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
```

## Visualize the K-Means

Since we determined that the number of clusters should be 2, then we can run the k-means algorithm with k=2. Let’s visualize our data into two dimensions.

```fviz_cluster(kmeans(scaled_data, centers = 2), geom = "point", data = scaled_date)
```

## Clusters and Classes in the same plot

Based on the analysis above, the suggested number of clusters in K-means was 2. Bear in mind that in our dataset we have also the dependent variable `diagnosis` which takes values `B` and `M`. Let’s represent at the same plot the `Clusters (k=2)` and the `Classes (B,M)`.

We will apply PCA by keeping the first two PCs.

```# get the PCA of the scaled data
pca_res <- prcomp(scaled_data)

# add the Cluster to the original data frame
df\$cluster<-as.factor(kmeans(scaled_date, centers = 2)\$cluster)

# add the PC1 and PC2 to the original data frame
df\$PC1<-pca_res\$x[,1]
df\$PC2<-pca_res\$x[,2]

# do a scatter plot of PC1, PC2 with a color of cluster on a separate graph
# for diagnosis is M and B

ggplot(aes(x=PC1, y=PC2, col=cluster), data=df)+geom_point()+facet_grid(.~diagnosis)
```

As we can see the majority of patients with a “Benign” tumor were in the first cluster and the patients with a “Malignant” tumor at the second cluster.

### Get updates and learn from the best

Python

#### Creating Dynamic Forms with Streamlit: A Step-by-Step Guide

In this blog post, we’ll teach you how to create dynamic forms based on user input using Streamlit’s session state

Python

#### How to Connect Wikipedia with ChatGPT and LangChain

ChatGPT’s knowledge is limited to its training data, which has the cutoff year of 2021. This implies that we cannot