Lesson 14: Clustering: Discovering Hidden Groups in Data

Introduction

So far in this course we have studied supervised learning.

Supervised learning means:

We know the answer.

Examples:

InputsTarget
Age, BMILength of Stay
Inventory, PriceSales
Customer FeaturesChurn

The model learns from known outcomes.


But what if no outcome exists?

Suppose we have customer data:

CustomerRecencyFrequencyMonetary
A512050000
B300103000
C20200100000

Question:

Are there natural customer groups?

We do not know the answer beforehand.

This is called:

Unsupervised Learning

One of the most important unsupervised methods is:

Clustering

What is Clustering?

Clustering attempts to find groups of similar observations.

Example:

Customer A
Customer B
Customer C
Customer D

The algorithm may discover:

Cluster 1
High Value Customers
Cluster 2
Occasional Customers
Cluster 3
Dormant Customers

without being told these groups exist.


Real-World Applications

Healthcare

Identify:

  • High-risk patients
  • Patient subtypes
  • Disease phenotypes

Supply Chain

Identify:

  • Fast-moving SKUs
  • Slow-moving SKUs
  • High-value customers
  • Similar retailers

Marketing

Identify:

  • Customer segments
  • Purchasing behaviors
  • Loyalty groups

The Most Popular Clustering Algorithm

The most common clustering method is:

K-Means Clustering

The “K” represents:

Number of clusters

Example:

K = 3

means:

Find 3 groups

The Basic Idea

Suppose we have customers.

The algorithm:

Step 1

Creates cluster centers.

Cluster A
Cluster B
Cluster C

Step 2

Assigns each customer to the nearest cluster.

Step 3

Recalculates cluster centers.

Step 4

Repeats until stable.


Example Dataset

Suppose we have RFM metrics.

import pandas as pd
customers = pd.DataFrame({
"Recency":[
5,
10,
20,
200,
250,
300
],
"Frequency":[
120,
100,
90,
20,
15,
10
],
"Monetary":[
50000,
45000,
40000,
5000,
3000,
2000
]
})

Why Scaling Matters

Variables are often measured on different scales.

Example:

Recency:
0 - 365
Frequency:
0 - 1000
Monetary:
0 - 1000000

Without scaling:

Monetary dominates
the clustering

Standardizing Variables

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(
customers
)

Now all variables have similar scales.


Fitting K-Means

from sklearn.cluster import KMeans
kmeans = KMeans(
n_clusters=3,
random_state=42
)
kmeans.fit(X_scaled)

Viewing Cluster Assignments

customers["Cluster"] = (
kmeans.labels_
)
print(customers)

Output:

RecencyFrequencyMonetaryCluster
5120500000
10100450000
2090400000
2002050001
2501530001
3001020002

Interpreting Clusters

Suppose:

Cluster 0

Low Recency
High Frequency
High Monetary

Interpretation:

Best Customers

Cluster 1

Moderate Activity

Interpretation:

Average Customers

Cluster 2

Old Purchases
Low Spending

Interpretation:

Dormant Customers

Visualizing Clusters

import matplotlib.pyplot as plt
plt.scatter(
customers["Recency"],
customers["Monetary"],
c=customers["Cluster"]
)
plt.xlabel("Recency")
plt.ylabel("Monetary")
plt.show()

Different colors represent different clusters.


Determining the Number of Clusters

One of the biggest challenges:

What value of K
should we choose?

The Elbow Method

Fit multiple values of K.

inertia = []
for k in range(1,11):
model = KMeans(
n_clusters=k,
random_state=42
)
model.fit(X_scaled)
inertia.append(
model.inertia_
)

Plot:

plt.plot(
range(1,11),
inertia
)
plt.xlabel("K")
plt.ylabel("Inertia")
plt.show()

Understanding the Elbow Plot

The goal is to find:

A sharp bend

in the curve.

Example:

K = 3

may provide the best balance between:

  • Simplicity
  • Accuracy

Cluster Profiles

One of the most useful steps.

customers.groupby(
"Cluster"
).mean(
numeric_only=True
)

Example:

ClusterRecencyFrequencyMonetary
01210345000
1225184000
2300102000

This table helps explain each cluster.


Healthcare Example

Suppose we have:

  • Age
  • BMI
  • Blood Pressure
  • Cholesterol
X = patients[
[
"Age",
"BMI",
"BloodPressure",
"Cholesterol"
]
]

Apply clustering:

X_scaled = scaler.fit_transform(X)
kmeans = KMeans(
n_clusters=3,
random_state=42
)
patients["Cluster"] = (
kmeans.fit_predict(
X_scaled
)
)

Question:

Do patient groups exist?

Supply Chain Example

Suppose we have:

  • Inventory Days
  • Turn
  • Sales
  • Margin
X = inventory[
[
"InventoryDays",
"Turn",
"Sales",
"Margin"
]
]

Cluster retailers:

X_scaled = scaler.fit_transform(X)
kmeans = KMeans(
n_clusters=4,
random_state=42
)
inventory["Cluster"] = (
kmeans.fit_predict(
X_scaled
)
)

Possible output:

Cluster 0
Top Performers
Cluster 1
Growing Stores
Cluster 2
Declining Stores
Cluster 3
Low Activity Stores

Hierarchical Clustering

Another clustering approach.

Instead of specifying:

K = 3

the algorithm builds a hierarchy.

Import:

from scipy.cluster.hierarchy import (
dendrogram,
linkage
)

Fit:

linked = linkage(
X_scaled,
method="ward"
)

Plot:

dendrogram(linked)
plt.show()

Advantages of K-Means

Fast

Works well on large datasets.


Easy to Understand

Simple concept.


Excellent for Segmentation

Perfect for:

  • Customers
  • Patients
  • Retailers
  • Products

Limitations

Must Choose K

Not always obvious.


Sensitive to Scaling

Always standardize first.


Assumes Spherical Clusters

May miss complex patterns.


Typical Analyst Workflow

Step 1

Select variables.

X = df[
["Recency",
"Frequency",
"Monetary"]
]

Step 2

Scale.

StandardScaler()

Step 3

Determine K.

Elbow Method

Step 4

Fit K-Means.

KMeans()

Step 5

Profile clusters.

groupby("Cluster")

Step 6

Interpret business meaning.


Real-World Example: RFM Segmentation

This is one of the most common business uses of clustering.

Variables:

  • Recency
  • Frequency
  • Monetary

The algorithm often discovers:

VIP Customers
Regular Customers
At-Risk Customers
Dormant Customers

without predefined labels.


Practical Healthcare Exercise

Cluster patients using:

  • Age
  • BMI
  • Blood Pressure
  • Cholesterol

Questions:

  • How many patient groups exist?
  • Which cluster has the highest risk profile?

Practical Supply Chain Exercise

Cluster customers using:

  • Recency
  • Frequency
  • Monetary

Questions:

  • Which customers are VIPs?
  • Which customers need reactivation?
  • Which customers have growth potential?

Lesson Summary

In this lesson we learned:

  • Unsupervised Learning
  • K-Means Clustering
  • Feature Scaling
  • Cluster Assignment
  • Cluster Profiling
  • Elbow Method
  • Hierarchical Clustering
  • Customer Segmentation
  • Healthcare Applications
  • Supply Chain Applications

Clustering is one of the most powerful exploratory tools because it reveals hidden structure in data without requiring labeled outcomes.

In the next lesson we will study Time Series Forecasting, where we learn how to predict future values such as sales, demand, inventory levels, and hospital admissions.

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading