Introduction

So far, most of the models we have studied have been statistical models:

Linear Regression
Logistic Regression
Poisson Regression
Negative Binomial Regression
Mixed Models

These models are extremely useful because they are interpretable.

However, real-world relationships are often complex.

For example:

Healthcare

A patient’s length of stay may depend on:

Age
Blood Pressure
Heart Rate
Diabetes
Smoking Status
Previous Admissions

The relationship is rarely a simple straight line.

Supply Chain

Annual SKU sales may depend on:

Inventory
Price
Customer Turn
Shape
Color Grade
Clarity Grade
Market Conditions

Again, these relationships are often nonlinear.

Random Forests were designed to automatically learn these complex patterns.

They are among the most popular machine learning algorithms used by data scientists today.

What is a Random Forest?

A Random Forest is an ensemble model.

Instead of building one decision tree, it builds hundreds of trees and combines their predictions.

Think of it like asking 500 experts for an opinion instead of asking only one.

			
Single Tree
    ↓
One Opinion
Random Forest
    ↓
Hundreds of Opinions
    ↓
Combined Prediction

		

This generally leads to:

Better accuracy
Better generalization
Less overfitting

Understanding Decision Trees

A Random Forest is built from Decision Trees.

Suppose we want to predict:

Sold vs Not Sold

A decision tree might learn:

DaysOut < 180 ?

If Yes:

Likely Sold

If No:

Check Price

If:

Price < 2000

Then:

Likely Sold

Otherwise:

Likely Not Sold

The tree keeps splitting data into increasingly homogeneous groups.

Why One Tree is Not Enough

Suppose a tree memorizes the training data.

Training Accuracy:

100%

Looks amazing.

But on new data:

60%

This is called overfitting.

Random Forests reduce overfitting by averaging many trees together.

Regression vs Classification

Random Forests can solve two types of problems.

Random Forest Regression

Used when the outcome is numeric.

Examples:

Revenue
Sales
Length of Stay
Inventory Value

Random Forest Classification

Used when the outcome is categorical.

Examples:

Sold vs Not Sold
Readmitted vs Not Readmitted
Churn vs No Churn

Creating a Dataset

			
import pandas as pd
data = pd.DataFrame({
    "Inventory": [50, 75, 100, 125, 150, 175, 200],
    "Price": [1000, 1200, 1300, 1500, 1700, 1900, 2100],
    "Sales": [80, 100, 140, 170, 200, 220, 260]
})
print(data)

		

Defining Features and Target

Features are predictors.

X = data[["Inventory", "Price"]]

Target is the outcome.

y = data["Sales"]

Train-Test Split

Machine learning models should always be evaluated on unseen data.

			
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42
)

		

Building a Random Forest Regression Model

			
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
    n_estimators=500,
    random_state=42
)
rf.fit(X_train, y_train)

		

Making Predictions

			
predictions = rf.predict(X_test)
print(predictions)

Example output:

[181.4 252.8]

These are predicted sales values.

Measuring Model Performance

The most common metric is R².

			
from sklearn.metrics import r2_score
r2 = r2_score(
    y_test,
    predictions
)
print(r2)

		

Example:

0.87

Interpretation:

87% of variation explained

Healthcare Example

Suppose we want to predict:

Length Of Stay

Dataset:

			
X = patients[
    [
        "Age",
        "BMI",
        "BloodPressure",
        "HeartRate"
    ]
]
y = patients["LengthOfStay"]

		

Fit the model:

			
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
    n_estimators=500,
    random_state=42
)
rf.fit(X, y)

		

Question:

Which patients are likely to stay longer?

Random Forest Classification

Suppose we want to predict:

Sold

where:

			
= Sold
= Not Sold

Dataset:

			
X = inventory[
    [
        "DaysOut",
        "Price",
        "CaratWeight"
    ]
]
y = inventory["Sold"]

		

Building a Classification Model

			
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators=500,
    random_state=42
)
rf.fit(X_train, y_train)

		

Predicting Classes

			
predictions = rf.predict(X_test)
print(predictions)

Example output:

[1 0 1 1]

Meaning:

			
Sold
Not Sold
Sold
Sold

Predicting Probabilities

Probabilities are often more useful than class labels.

			
probabilities = rf.predict_proba(X_test)
print(probabilities)

Example output:

			
[
 [0.10, 0.90],
 [0.80, 0.20]
]

Interpretation:

Observation 1:

90% probability of sale

Observation 2:

20% probability of sale

Classification Accuracy

			
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(
    y_test,
    predictions
)
print(accuracy)

		

Example:

0.89

Interpretation:

89% correct classifications

Feature Importance

One of the most useful outputs from a Random Forest.

Question:

Which variables matter most?

Extracting Feature Importance

			
importance = pd.DataFrame({
    "Variable": X.columns,
    "Importance": rf.feature_importances_
})
importance = importance.sort_values(
    "Importance",
    ascending=False
)
print(importance)

		

Example output:

Variable	Importance
Inventory	0.52
Price	0.28
CustomerTurn	0.20

Visualizing Feature Importance

			
import matplotlib.pyplot as plt
importance.plot(
    x="Variable",
    y="Importance",
    kind="bar"
)
plt.title(
    "Feature Importance"
)
plt.show()

		

This chart quickly shows which variables drive predictions.

Healthcare Interpretation

Suppose feature importance shows:

Variable	Importance
Age	0.40
Blood Pressure	0.30
BMI	0.20
Heart Rate	0.10

Interpretation:

			
Age contributes most
to the prediction.

Supply Chain Interpretation

Suppose:

Variable	Importance
Inventory	0.50
Customer Turn	0.25
Price	0.15
Shape	0.10

Interpretation:

			
Inventory is the strongest
driver of annual sales.

Advantages of Random Forests

Handles Nonlinear Relationships

Linear Regression assumes:

Straight-line relationships

Random Forests do not.

Handles Interactions Automatically

Suppose:

			
Price matters only
for certain inventory levels

Random Forests can learn that automatically.

Robust to Outliers

Extreme observations have less influence.

Requires Little Data Preparation

Usually no need for:

Scaling
Normalization
Transformations

Works with Many Variables

Can easily handle:

			
variables
variables
variables

Limitations

Harder to Interpret

Regression:

Coefficient = 2.5

Easy to explain.

Random Forest:

			
Hundreds of trees
working together

Harder to explain.

Larger Models

Require more memory and computation.

Not Designed for Causal Inference

Random Forests answer:

What will happen?

Not:

Why did it happen?

Complete Random Forest Workflow

			
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
X = data[["Inventory", "Price"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42
)
rf = RandomForestRegressor(
    n_estimators=500,
    random_state=42
)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print("R2:", r2_score(y_test, predictions))

		

Practical Healthcare Exercise

Predict:

Length Of Stay

using:

Age
BMI
Blood Pressure
Heart Rate
Smoking Status

Questions:

Which variables matter most?
Which patients are high risk?
How accurate are predictions?

Practical Supply Chain Exercise

Predict:

Annual SKU Sales

using:

Inventory
Price
Customer Turn
Shape
Color Grade
Clarity Grade

Questions:

Which variables drive sales?
Which SKUs are future top performers?
Which variables matter most?

Lesson Summary

In this lesson we learned:

Decision Trees
Random Forests
Regression Forests
Classification Forests
Train-Test Splits
Feature Importance
Prediction
Accuracy
R²
Healthcare Applications
Supply Chain Applications

Random Forests are one of the most powerful machine learning algorithms because they automatically learn complex nonlinear relationships while remaining relatively easy to use.

In the next lesson we will study Gradient Boosting and XGBoost, which often outperform Random Forests and are among the most successful predictive models in modern data science competitions and business analytics.

nerd-ish

Leave a ReplyCancel reply

Measure Theory Lesson 29: The Lebesgue Differentiation Theorem

Measure Theory Lesson 27: Lusin’s Theorem

Measure Theory Lesson 28: Differentiation of Measures

Lesson 12: Random Forests: Powerful Machine Learning for Prediction

Introduction

Healthcare

Supply Chain

What is a Random Forest?

Understanding Decision Trees

Why One Tree is Not Enough

Regression vs Classification

Random Forest Regression

Random Forest Classification

Creating a Dataset

Defining Features and Target

Train-Test Split

Building a Random Forest Regression Model

Making Predictions

Measuring Model Performance

Healthcare Example

Random Forest Classification

Building a Classification Model

Predicting Classes

Predicting Probabilities

Classification Accuracy

Feature Importance

Extracting Feature Importance

Visualizing Feature Importance

Healthcare Interpretation

Supply Chain Interpretation

Advantages of Random Forests

Handles Nonlinear Relationships

Handles Interactions Automatically

Robust to Outliers

Requires Little Data Preparation

Works with Many Variables

Limitations

Harder to Interpret

Larger Models

Not Designed for Causal Inference

Complete Random Forest Workflow

Practical Healthcare Exercise

Practical Supply Chain Exercise

Lesson Summary

Share this:

Like this:

Related posts:

Leave a ReplyCancel reply

Discover more from nerd-ish