Lesson 14: Clustering: Discovering Hidden Groups in Data

pj316

3–5 minutes

Coding, course: python for data analysis, courses, data-science, k-means-clustering, python

Introduction

So far in this course we have studied supervised learning.

Supervised learning means:

We know the answer.

Examples:

Inputs	Target
Age, BMI	Length of Stay
Inventory, Price	Sales
Customer Features	Churn

The model learns from known outcomes.

But what if no outcome exists?

Suppose we have customer data:

Customer	Recency	Frequency	Monetary
A	5	120	50000
B	300	10	3000
C	20	200	100000

Question:

Are there natural customer groups?

We do not know the answer beforehand.

This is called:

Unsupervised Learning

One of the most important unsupervised methods is:

Clustering

What is Clustering?

Clustering attempts to find groups of similar observations.

Example:

			
Customer A
Customer B
Customer C
Customer D

The algorithm may discover:

			
Cluster 1
High Value Customers

			
Cluster 2
Occasional Customers

			
Cluster 3
Dormant Customers

without being told these groups exist.

Real-World Applications

Healthcare

Identify:

High-risk patients
Patient subtypes
Disease phenotypes

Supply Chain

Identify:

Fast-moving SKUs
Slow-moving SKUs
High-value customers
Similar retailers

Marketing

Identify:

Customer segments
Purchasing behaviors
Loyalty groups

The Most Popular Clustering Algorithm

The most common clustering method is:

K-Means Clustering

The “K” represents:

Number of clusters

Example:

K = 3

means:

Find 3 groups

The Basic Idea

Suppose we have customers.

The algorithm:

Step 1

Creates cluster centers.

			
Cluster A
Cluster B
Cluster C

Step 2

Assigns each customer to the nearest cluster.

Step 3

Recalculates cluster centers.

Step 4

Repeats until stable.

Example Dataset

Suppose we have RFM metrics.

			
import pandas as pd
customers = pd.DataFrame({
    "Recency":[
        5,
        10,
        20,
        200,
        250,
        300
    ],
    "Frequency":[
        120,
        100,
        90,
        20,
        15,
        10
    ],
    "Monetary":[
        50000,
        45000,
        40000,
        5000,
        3000,
        2000
    ]
})

		

Why Scaling Matters

Variables are often measured on different scales.

Example:

			
Recency:
0 - 365

			
Frequency:
0 - 1000

			
Monetary:
0 - 1000000

Without scaling:

			
Monetary dominates
the clustering

Standardizing Variables

			
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(
    customers
)

		

Now all variables have similar scales.

Fitting K-Means

			
from sklearn.cluster import KMeans
kmeans = KMeans(
    n_clusters=3,
    random_state=42
)
kmeans.fit(X_scaled)

		

Viewing Cluster Assignments

			
customers["Cluster"] = (
    kmeans.labels_
)
print(customers)

Output:

Recency	Frequency	Monetary	Cluster
5	120	50000	0
10	100	45000	0
20	90	40000	0
200	20	5000	1
250	15	3000	1
300	10	2000	2

Interpreting Clusters

Suppose:

Cluster 0

			
Low Recency
High Frequency
High Monetary

Interpretation:

Best Customers

Cluster 1

Moderate Activity

Interpretation:

Average Customers

Cluster 2

			
Old Purchases
Low Spending

Interpretation:

Dormant Customers

Visualizing Clusters

			
import matplotlib.pyplot as plt
plt.scatter(
    customers["Recency"],
    customers["Monetary"],
    c=customers["Cluster"]
)
plt.xlabel("Recency")
plt.ylabel("Monetary")
plt.show()

		

Different colors represent different clusters.

Determining the Number of Clusters

One of the biggest challenges:

			
What value of K
should we choose?

The Elbow Method

Fit multiple values of K.

			
inertia = []
for k in range(1,11):
    model = KMeans(
        n_clusters=k,
        random_state=42
    )
    model.fit(X_scaled)
    inertia.append(
        model.inertia_
    )

		

Plot:

			
plt.plot(
    range(1,11),
    inertia
)
plt.xlabel("K")
plt.ylabel("Inertia")
plt.show()

		

Understanding the Elbow Plot

The goal is to find:

A sharp bend

in the curve.

Example:

K = 3

may provide the best balance between:

Simplicity
Accuracy

Cluster Profiles

One of the most useful steps.

			
customers.groupby(
    "Cluster"
).mean(
    numeric_only=True
)

		

Example:

Cluster	Recency	Frequency	Monetary
0	12	103	45000
1	225	18	4000
2	300	10	2000

This table helps explain each cluster.

Healthcare Example

Suppose we have:

Age
BMI
Blood Pressure
Cholesterol

			
X = patients[
    [
        "Age",
        "BMI",
        "BloodPressure",
        "Cholesterol"
    ]
]

		

Apply clustering:

			
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(
    n_clusters=3,
    random_state=42
)
patients["Cluster"] = (
    kmeans.fit_predict(
        X_scaled
    )
)

		

Question:

Do patient groups exist?

Supply Chain Example

Suppose we have:

Inventory Days
Turn
Sales
Margin

			
X = inventory[
    [
        "InventoryDays",
        "Turn",
        "Sales",
        "Margin"
    ]
]

		

Cluster retailers:

			
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(
    n_clusters=4,
    random_state=42
)
inventory["Cluster"] = (
    kmeans.fit_predict(
        X_scaled
    )
)

		

Possible output:

			
Cluster 0
Top Performers

			
Cluster 1
Growing Stores

			
Cluster 2
Declining Stores

			
Cluster 3
Low Activity Stores

Hierarchical Clustering

Another clustering approach.

Instead of specifying:

K = 3

the algorithm builds a hierarchy.

Import:

			
from scipy.cluster.hierarchy import (
    dendrogram,
    linkage
)

Fit:

			
linked = linkage(
    X_scaled,
    method="ward"
)

Plot:

			
dendrogram(linked)
plt.show()

Advantages of K-Means

Fast

Works well on large datasets.

Easy to Understand

Simple concept.

Excellent for Segmentation

Perfect for:

Customers
Patients
Retailers
Products

Limitations

Must Choose K

Not always obvious.

Sensitive to Scaling

Always standardize first.

Assumes Spherical Clusters

May miss complex patterns.

Typical Analyst Workflow

Step 1

Select variables.

			
X = df[
    ["Recency",
     "Frequency",
     "Monetary"]
]

		

Step 2

Scale.

StandardScaler()

Step 3

Determine K.

Elbow Method

Step 4

Fit K-Means.

KMeans()

Step 5

Profile clusters.

groupby("Cluster")

Step 6

Interpret business meaning.

Real-World Example: RFM Segmentation

This is one of the most common business uses of clustering.

Variables:

Recency
Frequency
Monetary

The algorithm often discovers:

VIP Customers

Regular Customers

At-Risk Customers

Dormant Customers

without predefined labels.

Practical Healthcare Exercise

Cluster patients using:

Age
BMI
Blood Pressure
Cholesterol

Questions:

How many patient groups exist?
Which cluster has the highest risk profile?

Practical Supply Chain Exercise

Cluster customers using:

Recency
Frequency
Monetary

Questions:

Which customers are VIPs?
Which customers need reactivation?
Which customers have growth potential?

Lesson Summary

In this lesson we learned:

Unsupervised Learning
K-Means Clustering
Feature Scaling
Cluster Assignment
Cluster Profiling
Elbow Method
Hierarchical Clustering
Customer Segmentation
Healthcare Applications
Supply Chain Applications

Clustering is one of the most powerful exploratory tools because it reveals hidden structure in data without requiring labeled outcomes.

In the next lesson we will study Time Series Forecasting, where we learn how to predict future values such as sales, demand, inventory levels, and hospital admissions.

nerd-ish

Leave a ReplyCancel reply