Kelvin

Posted on Apr 30

Unsupervised Machine Learning. K-Means & Hierarchical Clustering

#algorithms #datascience #beginners #machinelearning

Unsupervised machine learning is a branch of machine learning where models are trained on data without labelled outcomes. Unlike supervised learning, where the goal is to predict a known target, unsupervised learning focuses on discovering hidden patterns, structures, or relationships within the data.

Common tasks in unsupervised learning include:

Clustering (grouping similar data points)
Dimensionality reduction

Clustering is the process of grouping data points such that points within the same cluster are similar and points in different clusters are dissimilar.

Similarity is usually measured using distance metrics like:

Euclidean distance (most common)
Manhattan distance
Cosine similarity

K-Means Clustering.

K-Means is a partition-based clustering algorithm that divides data into K distinct clusters, where K is predefined. The goal is to minimize the within-cluster variance, also called inertia.

How K-Means Works

Choose K (number of clusters) - Example: K = 3
Initialize centroids randomly - These are K points representing cluster centers.
Assign data points to nearest centroid - Each point is assigned to the cluster with the closest centroid (using distance, usually Euclidean).
Update centroids - Compute the new centroid as the mean of all points in that cluster. iterate steps 3 and 4 until Centroids stop changing, or maximum iterations is reached.

K-Means Workflow.

Scaling data

#scaling data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)

Finding the best K.

# elbow method - finding the best number of clusters (K)
from sklearn.cluster import KMeans
inertias = []

for k in range (1, 11):
    km = KMeans (n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

Plotting the elbow curve. This helps identify the best K - where the curve starts to plateau.

#Plotting the elbow curve
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), inertias, 'o--', linewidth=2, markersize=8)

plt.xlabel('Number of clusters K')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow method - Finding the optimal K')
plt.xticks(range(1, 11))
plt.grid(alpha=0.3)
plt.show()

Elbow curve. (4 is our best k)

Training the model
Fits our model on 4 clusters then creates a new column named 'cluster'.

km_final =KMeans(n_clusters=4, random_state=42, n_init=10)
customers['cluster'] = km_final.fit_predict(X_scaled)
customers

Profiling clusters.
This code is all about understanding what each cluster actually represents after you’ve created them with K-Means.

#profile for each cluster
profile = customers.groupby('cluster').agg({
'annual_spend': 'mean',
'visit_frequency': 'mean',
'avg_basket_size': 'mean',
'cluster': 'count'
}).rename(columns={'cluster': 'count'}).round(0)

print(" Cluster profiles")
print(profile)

Visualization of the clusters and the centroids.

colours = [ '#3498db','#2ecc71', '#e74c3c', "#12f3a4"]
labels = ['Mid-value','VIP customers', 'Low spenders' , 'Occasional']

plt.figure(figsize=(9, 5))
for c in range(4):
    mask = customers['cluster'] == c
    plt.scatter(
    customers[mask]['annual_spend'],
    customers[mask]['visit_frequency'],
    c=colours[c],
    label= labels[c],
    alpha=0.7,
    s=50
)
centroids_orig = scaler.inverse_transform(km_final.cluster_centers_)
plt.scatter(
    centroids_orig[:, 0], 
    centroids_orig[:, 1],
    s = 200, 
    marker = 'X', 
    c = 'black', 
    zorder = 5, 
    label = 'Centroids'
)

plt.xlabel('Annual Spend (KES)')
plt.ylabel('Visit Frequency (per month)')
plt.title('Customer Segments - K-Means (K=4)')
plt.legend()
plt.grid(alpha = 0.2)

Advantages of K-Means

Simple and fast
Works well on large datasets
Easy to interpret

Limitations of K-Means

Must specify K in advance
Sensitive to - Initial centroid placement & Outliers
Assumes clusters are spherical and equally sized

Hierarchical Clustering.

Hierarchical clustering builds a tree-like structure of clusters, called a dendrogram. Unlike K-Means, it does not require specifying the number of clusters upfront.

There are two types:

Agglomerative (bottom-up) – most common
Divisive (top-down)

Agglomerative clustering
steps

Start with all points separate: Treat each data point as its own cluster like A, B, C, ... Initially, you have n clusters for n data points.
Compute pairwise distances: Calculate the distance between every pair of clusters. Common choices include Euclidean, Manhattan or Cosine distance. Store these values in a distance matrix. To know more about them refer to: Measures of Distance
Merge the nearest clusters: Identify the two clusters that are closest based on the chosen linkage method such as single, complete, average or Ward linkage. Combine them into a single new cluster.
Update distances: Recalculate the distances between the newly formed cluster and all remaining clusters. Use the same linkage rule to ensure consistency.
Repeat the process: Continue merging clusters and updating distances iteratively. Stop when you reach a predefined number of clusters (k) or a distance threshold.
Visualize the results: Create a dendrogram to visualize how clusters merged at each step. Choose a suitable cut on the dendrogram to obtain the final cluster groups.

Linkage methods
How we measure the distance between clusters.

Single Linkage: Minimum distance between points
Complete Linkage: Maximum distance
Average Linkage: Average distance
Ward’s Method: Minimizes variance (most common)

Dendrogram
A dendrogram is a tree diagram that shows:

How clusters are merged
At what distance they are merged You can “cut” the dendrogram at a certain height to decide the number of clusters.

Hierarchical model workflow.
Picking a dataset.

# Picking a subset of 60 customers for readability
import numpy as np
subset_ids = np.random.choice(len(X_scaled),60, replace = False)
X_sub = X_scaled[subset_ids]

Linkage (Ward) to minimize variance.

# Linkage method
from scipy.cluster.hierarchy import linkage, dendrogram

Z = linkage(X_sub,method = 'ward')

Plotting the dendrogram.

# plot dendogram
import matplotlib.pyplot as plt

plt.figure (figsize= (14, 5))
dendrogram (Z, truncate_mode= 'level')
plt.axhline(y=6, c="red", linestyle = '--', linewidth = 1.5, label = 'cut here for 3 clusters')
plt.legend()
plt.show()

Fitting the model.
Fits our model on 4 clusters then creates a new column named 'hc-cluster'.

from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 4, linkage = 'ward')
customers ['hc_cluster'] = hc.fit_predict(X_scaled)
customers

Profiling.
This step helps understand what each cluster represents.

print('Hierachical cluster profiles')
print(customers.groupby('hc_cluster')[['annual_spend', 'visit_frequency']].mean ().round(0))

Visualization.
A comparison between the two models. K-Means & Hierarchical clustering.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, col, title in zip(
    axes,
    ['cluster', 'hc_cluster'],
    ['K-means cluster', 'Hierarchical clusters']
):
    for c in range(4):
        mask = customers[col] == c
        ax.scatter(
            customers[mask]['annual_spend'],
            customers[mask]['visit_frequency'],
            alpha=0.7, s=40, label=f'cluster {c}'
        )


plt.tight_layout()
plt.show()