Unsupervised machine learning is a branch of machine learning where models are trained on data without labelled outcomes. Unlike supervised learning, where the goal is to predict a known target, unsupervised learning focuses on discovering hidden patterns, structures, or relationships within the data.
Common tasks in unsupervised learning include:
- Clustering (grouping similar data points)
- Dimensionality reduction
Clustering is the process of grouping data points such that points within the same cluster are similar and points in different clusters are dissimilar.
Similarity is usually measured using distance metrics like:
- Euclidean distance (most common)
- Manhattan distance
- Cosine similarity
K-Means Clustering.
K-Means is a partition-based clustering algorithm that divides data into K distinct clusters, where K is predefined. The goal is to minimize the within-cluster variance, also called inertia.
How K-Means Works
- Choose K (number of clusters) - Example: K = 3
- Initialize centroids randomly - These are K points representing cluster centers.
- Assign data points to nearest centroid - Each point is assigned to the cluster with the closest centroid (using distance, usually Euclidean).
- Update centroids - Compute the new centroid as the mean of all points in that cluster. iterate steps 3 and 4 until Centroids stop changing, or maximum iterations is reached.
K-Means Workflow.
Scaling data
#scaling data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)
Finding the best K.
# elbow method - finding the best number of clusters (K)
from sklearn.cluster import KMeans
inertias = []
for k in range (1, 11):
km = KMeans (n_clusters=k, random_state=42, n_init=10)
km.fit(X_scaled)
inertias.append(km.inertia_)
Plotting the elbow curve. This helps identify the best K - where the curve starts to plateau.
#Plotting the elbow curve
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), inertias, 'o--', linewidth=2, markersize=8)
plt.xlabel('Number of clusters K')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow method - Finding the optimal K')
plt.xticks(range(1, 11))
plt.grid(alpha=0.3)
plt.show()
Elbow curve. (4 is our best k)

Training the model
Fits our model on 4 clusters then creates a new column named 'cluster'.
km_final =KMeans(n_clusters=4, random_state=42, n_init=10)
customers['cluster'] = km_final.fit_predict(X_scaled)
customers
Profiling clusters.
This code is all about understanding what each cluster actually represents after you’ve created them with K-Means.
#profile for each cluster
profile = customers.groupby('cluster').agg({
'annual_spend': 'mean',
'visit_frequency': 'mean',
'avg_basket_size': 'mean',
'cluster': 'count'
}).rename(columns={'cluster': 'count'}).round(0)
print(" Cluster profiles")
print(profile)
Visualization of the clusters and the centroids.
colours = [ '#3498db','#2ecc71', '#e74c3c', "#12f3a4"]
labels = ['Mid-value','VIP customers', 'Low spenders' , 'Occasional']
plt.figure(figsize=(9, 5))
for c in range(4):
mask = customers['cluster'] == c
plt.scatter(
customers[mask]['annual_spend'],
customers[mask]['visit_frequency'],
c=colours[c],
label= labels[c],
alpha=0.7,
s=50
)
centroids_orig = scaler.inverse_transform(km_final.cluster_centers_)
plt.scatter(
centroids_orig[:, 0],
centroids_orig[:, 1],
s = 200,
marker = 'X',
c = 'black',
zorder = 5,
label = 'Centroids'
)
plt.xlabel('Annual Spend (KES)')
plt.ylabel('Visit Frequency (per month)')
plt.title('Customer Segments - K-Means (K=4)')
plt.legend()
plt.grid(alpha = 0.2)
- Simple and fast
- Works well on large datasets
- Easy to interpret
Limitations of K-Means
- Must specify K in advance
- Sensitive to - Initial centroid placement & Outliers
- Assumes clusters are spherical and equally sized
Hierarchical Clustering.
Hierarchical clustering builds a tree-like structure of clusters, called a dendrogram. Unlike K-Means, it does not require specifying the number of clusters upfront.
There are two types:
- Agglomerative (bottom-up) – most common
- Divisive (top-down)
Agglomerative clustering
steps
- Start with all points separate: Treat each data point as its own cluster like A, B, C, ... Initially, you have n clusters for n data points.
- Compute pairwise distances: Calculate the distance between every pair of clusters. Common choices include Euclidean, Manhattan or Cosine distance. Store these values in a distance matrix. To know more about them refer to: Measures of Distance
- Merge the nearest clusters: Identify the two clusters that are closest based on the chosen linkage method such as single, complete, average or Ward linkage. Combine them into a single new cluster.
- Update distances: Recalculate the distances between the newly formed cluster and all remaining clusters. Use the same linkage rule to ensure consistency.
- Repeat the process: Continue merging clusters and updating distances iteratively. Stop when you reach a predefined number of clusters (k) or a distance threshold.
- Visualize the results: Create a dendrogram to visualize how clusters merged at each step. Choose a suitable cut on the dendrogram to obtain the final cluster groups.
Linkage methods
How we measure the distance between clusters.
- Single Linkage: Minimum distance between points
- Complete Linkage: Maximum distance
- Average Linkage: Average distance
- Ward’s Method: Minimizes variance (most common)
Dendrogram
A dendrogram is a tree diagram that shows:
- How clusters are merged
- At what distance they are merged You can “cut” the dendrogram at a certain height to decide the number of clusters.
Hierarchical model workflow.
Picking a dataset.
# Picking a subset of 60 customers for readability
import numpy as np
subset_ids = np.random.choice(len(X_scaled),60, replace = False)
X_sub = X_scaled[subset_ids]
Linkage (Ward) to minimize variance.
# Linkage method
from scipy.cluster.hierarchy import linkage, dendrogram
Z = linkage(X_sub,method = 'ward')
Plotting the dendrogram.
# plot dendogram
import matplotlib.pyplot as plt
plt.figure (figsize= (14, 5))
dendrogram (Z, truncate_mode= 'level')
plt.axhline(y=6, c="red", linestyle = '--', linewidth = 1.5, label = 'cut here for 3 clusters')
plt.legend()
plt.show()
Fitting the model.
Fits our model on 4 clusters then creates a new column named 'hc-cluster'.
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 4, linkage = 'ward')
customers ['hc_cluster'] = hc.fit_predict(X_scaled)
customers
Profiling.
This step helps understand what each cluster represents.
print('Hierachical cluster profiles')
print(customers.groupby('hc_cluster')[['annual_spend', 'visit_frequency']].mean ().round(0))
Visualization.
A comparison between the two models. K-Means & Hierarchical clustering.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, col, title in zip(
axes,
['cluster', 'hc_cluster'],
['K-means cluster', 'Hierarchical clusters']
):
for c in range(4):
mask = customers[col] == c
ax.scatter(
customers[mask]['annual_spend'],
customers[mask]['visit_frequency'],
alpha=0.7, s=40, label=f'cluster {c}'
)
plt.tight_layout()
plt.show()
Advantages of Hierarchical Clustering
- No need to predefine number of clusters
- Produces interpretable tree structure
- Works well for small datasets
Limitations of Hierarchical Clustering
- Computationally expensive (slow for large datasets)
- Once clusters are merged, they cannot be undone
- Sensitive to noise and outliers







Top comments (0)