Introduction
Throughout this course, we have built increasingly sophisticated predictive models.
We learned:
- Linear Regression
- Logistic Regression
- Random Forests
- XGBoost
- Survival Analysis
- Bayesian Modeling
All of these models answer questions like:
What is likely to happen?
or
How strongly are two variables associated?
But there is a deeper question.
Did X actually cause Y?
This is the central question of causal inference.
Why Correlation Is Not Causation
Suppose we observe:
| Ice Cream Sales | Drowning Incidents |
|---|---|
| High | High |
| Low | Low |
A strong correlation exists.
Should we conclude:
Ice cream causes drowning?
Of course not.
A hidden variable exists:
Temperature
Hot weather causes:
- More ice cream sales
- More swimming
- More drowning incidents
This hidden variable is called a:
Confounder
The Fundamental Problem
For each person:
We observe:
What happened
We do NOT observe:
What would have happenedotherwise
Example:
Patient receives a treatment.
Outcome:
Recovered
Question:
Would they have recoveredwithout treatment?
We never observe both realities simultaneously.
This is called:
The Fundamental Problemof Causal Inference
Potential Outcomes Framework
Suppose:
Treatment = 1
means:
Received Treatment
and
Treatment = 0
means:
No Treatment
Each individual has two potential outcomes:
Y(1)
Outcome if treated.
Y(0)
Outcome if untreated.
The causal effect is:
Y(1)-Y(0)
The problem:
We only observe one.
Average Treatment Effect
The most common causal quantity.
ATE=E[Y(1)-Y(0)]
Interpretation:
Average effectof treatment
Example:
Treatment reduces hospital stay by:
2 days
Then:
ATE = -2
Randomized Controlled Trials
The gold standard.
Example:
| Group | Treatment |
|---|---|
| A | Yes |
| B | No |
Random assignment ensures:
Groups are comparable
Therefore:
Differencescan be interpreted causally
Healthcare Example
Question:
Does a new medicationreduce mortality?
Randomly assign:
Treatment
or
Placebo
Compare outcomes.
Supply Chain Example
Question:
Does a new replenishmentstrategy increase sales?
Randomly assign stores:
New Strategy
vs
Old Strategy
Compare performance.
Observational Data Problems
Most analysts do NOT have randomized experiments.
Example:
Patients choose treatment
or
Stores choose inventory levels
This introduces:
Selection Bias
Confounding Variables
Suppose:
Higher Inventory
is associated with:
Higher Sales
Can we conclude:
Inventory causes sales?
Maybe not.
Possible confounder:
Store Quality
Better stores:
- Hold more inventory
- Generate more sales
Without controlling for store quality:
Causal conclusionsare invalid
Directed Acyclic Graphs (DAGs)
DAGs help visualize causal relationships.
Example:
Store Quality ↓Inventory ↓Sales
or
Store Quality ↘ ↙ Sales
These diagrams help identify:
- Confounders
- Mediators
- Colliders
Regression Adjustment
One of the simplest causal methods.
Example:
import statsmodels.formula.api as smfmodel = smf.ols( "Sales ~ Inventory + StoreQuality", data=df).fit()print(model.summary())
By controlling for:
StoreQuality
we attempt to isolate:
Inventory Effect
Propensity Scores
A powerful observational method.
Question:
How likely was someoneto receive treatment?
Estimate:
P(Treatment=1|X)
using logistic regression.
Example
from sklearn.linear_model import LogisticRegressionps_model = LogisticRegression()ps_model.fit( X, treatment)propensity_scores = ( ps_model.predict_proba(X) [:,1])
These probabilities help create balanced groups.
Matching
Match similar observations.
Example:
Patient A:
Age = 60BMI = 30
Treatment:
Yes
Find similar patient:
Age = 61BMI = 29
Treatment:
No
Compare outcomes.
Difference-in-Differences
Widely used in business analytics.
Suppose:
| Store | Before | After |
|---|---|---|
| Treatment | 100 | 140 |
| Control | 100 | 110 |
Treatment change:
+40
Control change:
+10
Causal estimate:
40-10=30
Interpretation:
Estimated causal effect= +30
Difference-in-Differences in Python
import statsmodels.formula.api as smfmodel = smf.ols( "Sales ~ Treatment + Post + Treatment:Post", data=df).fit()print(model.summary())
The interaction term estimates the treatment effect.
Instrumental Variables
Used when confounding is severe.
Requirements:
Instrument must:
- Affect treatment
- Not directly affect outcome
Example:
Healthcare:
Distance to Hospital
may influence:
Treatment Choice
but not directly:
Patient Outcome
Causal Forests
Modern machine learning approach.
Extension of:
Random Forests
Designed to estimate:
Heterogeneous Treatment Effects
Question:
Which customers benefit most?
instead of:
What is the average effect?
Healthcare Example
Question:
Does a medicationreduce mortality?
Potential confounders:
- Age
- BMI
- Smoking
- Diabetes
Methods:
- Randomized Trial
- Propensity Scores
- Matching
- Instrumental Variables
Supply Chain Example
Question:
Does increasing deploymentincrease sales?
Potential confounders:
- Store Quality
- Customer Base
- Local Market Conditions
Methods:
- Regression Adjustment
- Matching
- Difference-in-Differences
Why Causal Inference Matters
Machine Learning answers:
Who will buy?
Causal Inference answers:
How can we increase sales?
Machine Learning answers:
Who will be readmitted?
Causal Inference answers:
How can we prevent readmission?
Prediction and causation are different.
Typical Analyst Workflow
Step 1
Define:
Treatment
and
Outcome
Step 2
Identify confounders.
Step 3
Draw a DAG.
Step 4
Choose a method:
- Randomization
- Regression
- Matching
- Propensity Scores
- Difference-in-Differences
- Instrumental Variables
Step 5
Estimate treatment effect.
Step 6
Interpret causal impact.
Practical Healthcare Exercise
Question:
Does a new medicationreduce readmission?
Treatment:
Medication
Outcome:
Readmission
Confounders:
- Age
- BMI
- Diabetes
- Prior Admissions
Estimate:
Average Treatment Effect
Practical Supply Chain Exercise
Question:
Does increasing deploymentincrease sales?
Treatment:
Deployment Increase
Outcome:
Sales
Confounders:
- Customer Turn
- Market Size
- Historical Sales
Estimate:
Causal Impact
Lesson Summary
In this lesson we learned:
- Correlation vs Causation
- Potential Outcomes
- Average Treatment Effect
- Randomized Controlled Trials
- Confounding
- DAGs
- Regression Adjustment
- Propensity Scores
- Matching
- Difference-in-Differences
- Instrumental Variables
- Causal Forests
Causal Inference is the culmination of data analysis because it moves beyond prediction and asks the most important business and scientific question:
What interventionwill actually change outcomes?
Course Summary: The Essential Data Analyst → Data Scientist Toolkit
You have now covered:
- Pandas Foundations
- Data Cleaning
- GroupBy and Aggregation
- Data Integration and Joins
- Data Visualization
- Statistical Testing
- Linear Regression
- Logistic Regression
- Poisson Regression
- Negative Binomial Regression
- Mixed Models
- Random Forests
- XGBoost
- Clustering
- Time Series Forecasting
- Survival Analysis
- Bayesian Modeling
- Causal Inference
This covers a large portion of the statistical and machine learning toolkit used by practicing data analysts and data scientists in healthcare, supply chain, retail, finance, and technology.

Leave a Reply