Determining the optimal number of clusters in the K-means method
K-means which is a clustering method tries to group statistically similar data by using iterative method. “K” which is the initial letter of the algorithm’s name indicate the number of cluster. Each point is determined to belong to certain cluster by evaluating in terms of the degree of closeness of the certain cluster centers.
Basic Rule:
Distance between groups is maximum, distance between groups is minimum
How to determine K value:
Let’s imagine a little 😊 Let have my sample size 25 and set the number of k (count of clusters) 25. That’s time, I am sure that within cluster distance would be zero (Well that’s a super idea) but here did we really want to do that? Actually we want to separate just optimum clusters until to disregard small details and reduce complexity and make group them. I think this is the real problem in profession life… As a result, it is important that we determine the number k value correctly in order to increase the real interpretability of our work.
Now let’s see how we determine the optimal number of k.
There is a data set of 25 samples. Also 3 important dimensions: Data, Voice, SMS values used by customers’ features. First, let’s analyze the dataset at using R programing to do a descriptive analysis.
customer <- read.csv(“customer.csv”, header = TRUE, sep=”;”, row.names=1)
df <- customer
summary(df)
df <- scale(df)
as_tibble(df)
rownames(df)
library(funModeling)
profiling_num(customer)
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = “#00AFBB”, mid = “white”, high = “#FC4E07”))
k2
k3 <- kmeans(df , centers = 3, nstart = 25)
k4 <- kmeans(df , centers = 4, nstart = 25)
k5 <- kmeans(df , centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = “point”, data = df) + ggtitle(“k=2”)
p2 <- fviz_cluster(k3, geom = “point”, data = df) + ggtitle(“k=3”)
p3 <- fviz_cluster(k4, geom = “point”, data = df) + ggtitle(“k=4”)
p4 <- fviz_cluster(k5, geom = “point”, data = df) + ggtitle(“k=5”)
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
Elbow Method
This method is the most preferred method that shows us the big picture. WCSS (Within Cluster Sum of Square) is calculated with the sum of the square of the distance of each point from the center of the cluster. Elbow Method indicates that as called Elbow the point where the amount of change in WCSS decreases, in other words, the elbow point is the optimum point.
set.seed(123)
fviz_nbclust(df, kmeans, method = “wss”)
Average Silhouette Index
Rousseeuw (1987) proposed a Silhouette index that would define accuracy ratio of each point for its own cluster.
a(i); To show the average distance (similarity) of the point to all points in its cluster
b (i); To show the minimum of the average distances of the cluster to all points in other sets.
To formula for Silhouette index:
Sil(i) = (b(i) — a(i)) / max (a(i),b(i))
Sil(i) ranges from -1 to +1. The more the value approaches +1, the more true to say that the clusters are separated properly.
GAP Statistics
It compares the log () value difference between the sum of the squares of the actual values and their expected values. Within this, To make a sample, it is used with Bootstrapping method. In a nutshell, it shows that the more so far from these uniform distributions, the more meaningful to us.
set.seed(123)
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, K.max = 5, B = 25)
print(gap_stat, method = “firstmax”)
As it is determined K=5 then showed summary plot above:
set.seed(123)
final <- kmeans(df, 5, nstart = 25)
print(final)
customer %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all(“mean”)
Finally:
There are many such a similar methods, the important thing is to know why that is and also to choose the most meaningful for you and explain them with the reasons.
I wish Good luck for your studies.