Lesson 12: Random Forests: Powerful Machine Learning for Prediction

Introduction

So far, most of the models we have studied have been statistical models:

  • Linear Regression
  • Logistic Regression
  • Poisson Regression
  • Negative Binomial Regression
  • Mixed Models

These models are extremely useful because they are interpretable.

However, real-world relationships are often complex.

For example:

Healthcare

A patient’s length of stay may depend on:

  • Age
  • Blood Pressure
  • Heart Rate
  • Diabetes
  • Smoking Status
  • Previous Admissions

The relationship is rarely a simple straight line.

Supply Chain

Annual SKU sales may depend on:

  • Inventory
  • Price
  • Customer Turn
  • Shape
  • Color Grade
  • Clarity Grade
  • Market Conditions

Again, these relationships are often nonlinear.

Random Forests were designed to automatically learn these complex patterns.

They are among the most popular machine learning algorithms used by data scientists today.


What is a Random Forest?

A Random Forest is an ensemble model.

Instead of building one decision tree, it builds hundreds of trees and combines their predictions.

Think of it like asking 500 experts for an opinion instead of asking only one.

Single Tree
One Opinion
Random Forest
Hundreds of Opinions
Combined Prediction

This generally leads to:

  • Better accuracy
  • Better generalization
  • Less overfitting

Understanding Decision Trees

A Random Forest is built from Decision Trees.

Suppose we want to predict:

Sold vs Not Sold

A decision tree might learn:

DaysOut < 180 ?

If Yes:

Likely Sold

If No:

Check Price

If:

Price < 2000

Then:

Likely Sold

Otherwise:

Likely Not Sold

The tree keeps splitting data into increasingly homogeneous groups.


Why One Tree is Not Enough

Suppose a tree memorizes the training data.

Training Accuracy:

100%

Looks amazing.

But on new data:

60%

This is called overfitting.

Random Forests reduce overfitting by averaging many trees together.


Regression vs Classification

Random Forests can solve two types of problems.

Random Forest Regression

Used when the outcome is numeric.

Examples:

  • Revenue
  • Sales
  • Length of Stay
  • Inventory Value

Random Forest Classification

Used when the outcome is categorical.

Examples:

  • Sold vs Not Sold
  • Readmitted vs Not Readmitted
  • Churn vs No Churn

Creating a Dataset

import pandas as pd
data = pd.DataFrame({
"Inventory": [50, 75, 100, 125, 150, 175, 200],
"Price": [1000, 1200, 1300, 1500, 1700, 1900, 2100],
"Sales": [80, 100, 140, 170, 200, 220, 260]
})
print(data)

Defining Features and Target

Features are predictors.

X = data[["Inventory", "Price"]]

Target is the outcome.

y = data["Sales"]

Train-Test Split

Machine learning models should always be evaluated on unseen data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.20,
random_state=42
)

Building a Random Forest Regression Model

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
n_estimators=500,
random_state=42
)
rf.fit(X_train, y_train)

Making Predictions

predictions = rf.predict(X_test)
print(predictions)

Example output:

[181.4 252.8]

These are predicted sales values.


Measuring Model Performance

The most common metric is R².

from sklearn.metrics import r2_score
r2 = r2_score(
y_test,
predictions
)
print(r2)

Example:

0.87

Interpretation:

87% of variation explained

Healthcare Example

Suppose we want to predict:

Length Of Stay

Dataset:

X = patients[
[
"Age",
"BMI",
"BloodPressure",
"HeartRate"
]
]
y = patients["LengthOfStay"]

Fit the model:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
n_estimators=500,
random_state=42
)
rf.fit(X, y)

Question:

Which patients are likely to stay longer?

Random Forest Classification

Suppose we want to predict:

Sold

where:

1 = Sold
0 = Not Sold

Dataset:

X = inventory[
[
"DaysOut",
"Price",
"CaratWeight"
]
]
y = inventory["Sold"]

Building a Classification Model

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=500,
random_state=42
)
rf.fit(X_train, y_train)

Predicting Classes

predictions = rf.predict(X_test)
print(predictions)

Example output:

[1 0 1 1]

Meaning:

Sold
Not Sold
Sold
Sold

Predicting Probabilities

Probabilities are often more useful than class labels.

probabilities = rf.predict_proba(X_test)
print(probabilities)

Example output:

[
[0.10, 0.90],
[0.80, 0.20]
]

Interpretation:

Observation 1:

90% probability of sale

Observation 2:

20% probability of sale

Classification Accuracy

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(
y_test,
predictions
)
print(accuracy)

Example:

0.89

Interpretation:

89% correct classifications

Feature Importance

One of the most useful outputs from a Random Forest.

Question:

Which variables matter most?

Extracting Feature Importance

importance = pd.DataFrame({
"Variable": X.columns,
"Importance": rf.feature_importances_
})
importance = importance.sort_values(
"Importance",
ascending=False
)
print(importance)

Example output:

VariableImportance
Inventory0.52
Price0.28
CustomerTurn0.20

Visualizing Feature Importance

import matplotlib.pyplot as plt
importance.plot(
x="Variable",
y="Importance",
kind="bar"
)
plt.title(
"Feature Importance"
)
plt.show()

This chart quickly shows which variables drive predictions.


Healthcare Interpretation

Suppose feature importance shows:

VariableImportance
Age0.40
Blood Pressure0.30
BMI0.20
Heart Rate0.10

Interpretation:

Age contributes most
to the prediction.

Supply Chain Interpretation

Suppose:

VariableImportance
Inventory0.50
Customer Turn0.25
Price0.15
Shape0.10

Interpretation:

Inventory is the strongest
driver of annual sales.

Advantages of Random Forests

Handles Nonlinear Relationships

Linear Regression assumes:

Straight-line relationships

Random Forests do not.


Handles Interactions Automatically

Suppose:

Price matters only
for certain inventory levels

Random Forests can learn that automatically.


Robust to Outliers

Extreme observations have less influence.


Requires Little Data Preparation

Usually no need for:

  • Scaling
  • Normalization
  • Transformations

Works with Many Variables

Can easily handle:

10 variables
100 variables
1000 variables

Limitations

Harder to Interpret

Regression:

Coefficient = 2.5

Easy to explain.

Random Forest:

Hundreds of trees
working together

Harder to explain.


Larger Models

Require more memory and computation.


Not Designed for Causal Inference

Random Forests answer:

What will happen?

Not:

Why did it happen?

Complete Random Forest Workflow

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
X = data[["Inventory", "Price"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.20,
random_state=42
)
rf = RandomForestRegressor(
n_estimators=500,
random_state=42
)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print("R2:", r2_score(y_test, predictions))

Practical Healthcare Exercise

Predict:

Length Of Stay

using:

  • Age
  • BMI
  • Blood Pressure
  • Heart Rate
  • Smoking Status

Questions:

  • Which variables matter most?
  • Which patients are high risk?
  • How accurate are predictions?

Practical Supply Chain Exercise

Predict:

Annual SKU Sales

using:

  • Inventory
  • Price
  • Customer Turn
  • Shape
  • Color Grade
  • Clarity Grade

Questions:

  • Which variables drive sales?
  • Which SKUs are future top performers?
  • Which variables matter most?

Lesson Summary

In this lesson we learned:

  • Decision Trees
  • Random Forests
  • Regression Forests
  • Classification Forests
  • Train-Test Splits
  • Feature Importance
  • Prediction
  • Accuracy
  • Healthcare Applications
  • Supply Chain Applications

Random Forests are one of the most powerful machine learning algorithms because they automatically learn complex nonlinear relationships while remaining relatively easy to use.

In the next lesson we will study Gradient Boosting and XGBoost, which often outperform Random Forests and are among the most successful predictive models in modern data science competitions and business analytics.

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading