Introduction
So far, most of the models we have studied have been statistical models:
- Linear Regression
- Logistic Regression
- Poisson Regression
- Negative Binomial Regression
- Mixed Models
These models are extremely useful because they are interpretable.
However, real-world relationships are often complex.
For example:
Healthcare
A patient’s length of stay may depend on:
- Age
- Blood Pressure
- Heart Rate
- Diabetes
- Smoking Status
- Previous Admissions
The relationship is rarely a simple straight line.
Supply Chain
Annual SKU sales may depend on:
- Inventory
- Price
- Customer Turn
- Shape
- Color Grade
- Clarity Grade
- Market Conditions
Again, these relationships are often nonlinear.
Random Forests were designed to automatically learn these complex patterns.
They are among the most popular machine learning algorithms used by data scientists today.
What is a Random Forest?
A Random Forest is an ensemble model.
Instead of building one decision tree, it builds hundreds of trees and combines their predictions.
Think of it like asking 500 experts for an opinion instead of asking only one.
Single Tree ↓One OpinionRandom Forest ↓Hundreds of Opinions ↓Combined Prediction
This generally leads to:
- Better accuracy
- Better generalization
- Less overfitting
Understanding Decision Trees
A Random Forest is built from Decision Trees.
Suppose we want to predict:
Sold vs Not Sold
A decision tree might learn:
DaysOut < 180 ?
If Yes:
Likely Sold
If No:
Check Price
If:
Price < 2000
Then:
Likely Sold
Otherwise:
Likely Not Sold
The tree keeps splitting data into increasingly homogeneous groups.
Why One Tree is Not Enough
Suppose a tree memorizes the training data.
Training Accuracy:
100%
Looks amazing.
But on new data:
60%
This is called overfitting.
Random Forests reduce overfitting by averaging many trees together.
Regression vs Classification
Random Forests can solve two types of problems.
Random Forest Regression
Used when the outcome is numeric.
Examples:
- Revenue
- Sales
- Length of Stay
- Inventory Value
Random Forest Classification
Used when the outcome is categorical.
Examples:
- Sold vs Not Sold
- Readmitted vs Not Readmitted
- Churn vs No Churn
Creating a Dataset
import pandas as pddata = pd.DataFrame({ "Inventory": [50, 75, 100, 125, 150, 175, 200], "Price": [1000, 1200, 1300, 1500, 1700, 1900, 2100], "Sales": [80, 100, 140, 170, 200, 220, 260]})print(data)
Defining Features and Target
Features are predictors.
X = data[["Inventory", "Price"]]
Target is the outcome.
y = data["Sales"]
Train-Test Split
Machine learning models should always be evaluated on unseen data.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42)
Building a Random Forest Regression Model
from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor( n_estimators=500, random_state=42)rf.fit(X_train, y_train)
Making Predictions
predictions = rf.predict(X_test)print(predictions)
Example output:
[181.4 252.8]
These are predicted sales values.
Measuring Model Performance
The most common metric is R².
from sklearn.metrics import r2_scorer2 = r2_score( y_test, predictions)print(r2)
Example:
0.87
Interpretation:
87% of variation explained
Healthcare Example
Suppose we want to predict:
Length Of Stay
Dataset:
X = patients[ [ "Age", "BMI", "BloodPressure", "HeartRate" ]]y = patients["LengthOfStay"]
Fit the model:
from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor( n_estimators=500, random_state=42)rf.fit(X, y)
Question:
Which patients are likely to stay longer?
Random Forest Classification
Suppose we want to predict:
Sold
where:
1 = Sold0 = Not Sold
Dataset:
X = inventory[ [ "DaysOut", "Price", "CaratWeight" ]]y = inventory["Sold"]
Building a Classification Model
from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier( n_estimators=500, random_state=42)rf.fit(X_train, y_train)
Predicting Classes
predictions = rf.predict(X_test)print(predictions)
Example output:
[1 0 1 1]
Meaning:
SoldNot SoldSoldSold
Predicting Probabilities
Probabilities are often more useful than class labels.
probabilities = rf.predict_proba(X_test)print(probabilities)
Example output:
[ [0.10, 0.90], [0.80, 0.20]]
Interpretation:
Observation 1:
90% probability of sale
Observation 2:
20% probability of sale
Classification Accuracy
from sklearn.metrics import accuracy_scoreaccuracy = accuracy_score( y_test, predictions)print(accuracy)
Example:
0.89
Interpretation:
89% correct classifications
Feature Importance
One of the most useful outputs from a Random Forest.
Question:
Which variables matter most?
Extracting Feature Importance
importance = pd.DataFrame({ "Variable": X.columns, "Importance": rf.feature_importances_})importance = importance.sort_values( "Importance", ascending=False)print(importance)
Example output:
| Variable | Importance |
|---|---|
| Inventory | 0.52 |
| Price | 0.28 |
| CustomerTurn | 0.20 |
Visualizing Feature Importance
import matplotlib.pyplot as pltimportance.plot( x="Variable", y="Importance", kind="bar")plt.title( "Feature Importance")plt.show()
This chart quickly shows which variables drive predictions.
Healthcare Interpretation
Suppose feature importance shows:
| Variable | Importance |
|---|---|
| Age | 0.40 |
| Blood Pressure | 0.30 |
| BMI | 0.20 |
| Heart Rate | 0.10 |
Interpretation:
Age contributes mostto the prediction.
Supply Chain Interpretation
Suppose:
| Variable | Importance |
|---|---|
| Inventory | 0.50 |
| Customer Turn | 0.25 |
| Price | 0.15 |
| Shape | 0.10 |
Interpretation:
Inventory is the strongestdriver of annual sales.
Advantages of Random Forests
Handles Nonlinear Relationships
Linear Regression assumes:
Straight-line relationships
Random Forests do not.
Handles Interactions Automatically
Suppose:
Price matters onlyfor certain inventory levels
Random Forests can learn that automatically.
Robust to Outliers
Extreme observations have less influence.
Requires Little Data Preparation
Usually no need for:
- Scaling
- Normalization
- Transformations
Works with Many Variables
Can easily handle:
10 variables100 variables1000 variables
Limitations
Harder to Interpret
Regression:
Coefficient = 2.5
Easy to explain.
Random Forest:
Hundreds of treesworking together
Harder to explain.
Larger Models
Require more memory and computation.
Not Designed for Causal Inference
Random Forests answer:
What will happen?
Not:
Why did it happen?
Complete Random Forest Workflow
from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import r2_scoreX = data[["Inventory", "Price"]]y = data["Sales"]X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42)rf = RandomForestRegressor( n_estimators=500, random_state=42)rf.fit(X_train, y_train)predictions = rf.predict(X_test)print("R2:", r2_score(y_test, predictions))
Practical Healthcare Exercise
Predict:
Length Of Stay
using:
- Age
- BMI
- Blood Pressure
- Heart Rate
- Smoking Status
Questions:
- Which variables matter most?
- Which patients are high risk?
- How accurate are predictions?
Practical Supply Chain Exercise
Predict:
Annual SKU Sales
using:
- Inventory
- Price
- Customer Turn
- Shape
- Color Grade
- Clarity Grade
Questions:
- Which variables drive sales?
- Which SKUs are future top performers?
- Which variables matter most?
Lesson Summary
In this lesson we learned:
- Decision Trees
- Random Forests
- Regression Forests
- Classification Forests
- Train-Test Splits
- Feature Importance
- Prediction
- Accuracy
- R²
- Healthcare Applications
- Supply Chain Applications
Random Forests are one of the most powerful machine learning algorithms because they automatically learn complex nonlinear relationships while remaining relatively easy to use.
In the next lesson we will study Gradient Boosting and XGBoost, which often outperform Random Forests and are among the most successful predictive models in modern data science competitions and business analytics.

Leave a Reply