Lesson 18: Causal Inference: Determining Whether X Actually Causes Y

Introduction

Throughout this course, we have built increasingly sophisticated predictive models.

We learned:

  • Linear Regression
  • Logistic Regression
  • Random Forests
  • XGBoost
  • Survival Analysis
  • Bayesian Modeling

All of these models answer questions like:

What is likely to happen?

or

How strongly are two variables associated?

But there is a deeper question.

Did X actually cause Y?

This is the central question of causal inference.


Why Correlation Is Not Causation

Suppose we observe:

Ice Cream SalesDrowning Incidents
HighHigh
LowLow

A strong correlation exists.

Should we conclude:

Ice cream causes drowning?

Of course not.

A hidden variable exists:

Temperature

Hot weather causes:

  • More ice cream sales
  • More swimming
  • More drowning incidents

This hidden variable is called a:

Confounder

The Fundamental Problem

For each person:

We observe:

What happened

We do NOT observe:

What would have happened
otherwise

Example:

Patient receives a treatment.

Outcome:

Recovered

Question:

Would they have recovered
without treatment?

We never observe both realities simultaneously.

This is called:

The Fundamental Problem
of Causal Inference

Potential Outcomes Framework

Suppose:

Treatment = 1

means:

Received Treatment

and

Treatment = 0

means:

No Treatment

Each individual has two potential outcomes:

Y(1)

Outcome if treated.

Y(0)

Outcome if untreated.

The causal effect is:

Y(1)-Y(0)

The problem:

We only observe one.

Average Treatment Effect

The most common causal quantity.

ATE=E[Y(1)-Y(0)]

Interpretation:

Average effect
of treatment

Example:

Treatment reduces hospital stay by:

2 days

Then:

ATE = -2

Randomized Controlled Trials

The gold standard.

Example:

GroupTreatment
AYes
BNo

Random assignment ensures:

Groups are comparable

Therefore:

Differences
can be interpreted causally

Healthcare Example

Question:

Does a new medication
reduce mortality?

Randomly assign:

Treatment

or

Placebo

Compare outcomes.


Supply Chain Example

Question:

Does a new replenishment
strategy increase sales?

Randomly assign stores:

New Strategy

vs

Old Strategy

Compare performance.


Observational Data Problems

Most analysts do NOT have randomized experiments.

Example:

Patients choose treatment

or

Stores choose inventory levels

This introduces:

Selection Bias

Confounding Variables

Suppose:

Higher Inventory

is associated with:

Higher Sales

Can we conclude:

Inventory causes sales?

Maybe not.

Possible confounder:

Store Quality

Better stores:

  • Hold more inventory
  • Generate more sales

Without controlling for store quality:

Causal conclusions
are invalid

Directed Acyclic Graphs (DAGs)

DAGs help visualize causal relationships.

Example:

Store Quality
Inventory
Sales

or

Store Quality
↘ ↙
Sales

These diagrams help identify:

  • Confounders
  • Mediators
  • Colliders

Regression Adjustment

One of the simplest causal methods.

Example:

import statsmodels.formula.api as smf
model = smf.ols(
"Sales ~ Inventory + StoreQuality",
data=df
).fit()
print(model.summary())

By controlling for:

StoreQuality

we attempt to isolate:

Inventory Effect

Propensity Scores

A powerful observational method.

Question:

How likely was someone
to receive treatment?

Estimate:

P(Treatment=1|X)

using logistic regression.


Example

from sklearn.linear_model import LogisticRegression
ps_model = LogisticRegression()
ps_model.fit(
X,
treatment
)
propensity_scores = (
ps_model.predict_proba(X)
[:,1]
)

These probabilities help create balanced groups.


Matching

Match similar observations.

Example:

Patient A:

Age = 60
BMI = 30

Treatment:

Yes

Find similar patient:

Age = 61
BMI = 29

Treatment:

No

Compare outcomes.


Difference-in-Differences

Widely used in business analytics.

Suppose:

StoreBeforeAfter
Treatment100140
Control100110

Treatment change:

+40

Control change:

+10

Causal estimate:

40-10=30

Interpretation:

Estimated causal effect
= +30

Difference-in-Differences in Python

import statsmodels.formula.api as smf
model = smf.ols(
"Sales ~ Treatment + Post + Treatment:Post",
data=df
).fit()
print(model.summary())

The interaction term estimates the treatment effect.


Instrumental Variables

Used when confounding is severe.

Requirements:

Instrument must:

  1. Affect treatment
  2. Not directly affect outcome

Example:

Healthcare:

Distance to Hospital

may influence:

Treatment Choice

but not directly:

Patient Outcome

Causal Forests

Modern machine learning approach.

Extension of:

Random Forests

Designed to estimate:

Heterogeneous Treatment Effects

Question:

Which customers benefit most?

instead of:

What is the average effect?

Healthcare Example

Question:

Does a medication
reduce mortality?

Potential confounders:

  • Age
  • BMI
  • Smoking
  • Diabetes

Methods:

  • Randomized Trial
  • Propensity Scores
  • Matching
  • Instrumental Variables

Supply Chain Example

Question:

Does increasing deployment
increase sales?

Potential confounders:

  • Store Quality
  • Customer Base
  • Local Market Conditions

Methods:

  • Regression Adjustment
  • Matching
  • Difference-in-Differences

Why Causal Inference Matters

Machine Learning answers:

Who will buy?

Causal Inference answers:

How can we increase sales?

Machine Learning answers:

Who will be readmitted?

Causal Inference answers:

How can we prevent readmission?

Prediction and causation are different.


Typical Analyst Workflow

Step 1

Define:

Treatment

and

Outcome

Step 2

Identify confounders.


Step 3

Draw a DAG.


Step 4

Choose a method:

  • Randomization
  • Regression
  • Matching
  • Propensity Scores
  • Difference-in-Differences
  • Instrumental Variables

Step 5

Estimate treatment effect.


Step 6

Interpret causal impact.


Practical Healthcare Exercise

Question:

Does a new medication
reduce readmission?

Treatment:

Medication

Outcome:

Readmission

Confounders:

  • Age
  • BMI
  • Diabetes
  • Prior Admissions

Estimate:

Average Treatment Effect

Practical Supply Chain Exercise

Question:

Does increasing deployment
increase sales?

Treatment:

Deployment Increase

Outcome:

Sales

Confounders:

  • Customer Turn
  • Market Size
  • Historical Sales

Estimate:

Causal Impact

Lesson Summary

In this lesson we learned:

  • Correlation vs Causation
  • Potential Outcomes
  • Average Treatment Effect
  • Randomized Controlled Trials
  • Confounding
  • DAGs
  • Regression Adjustment
  • Propensity Scores
  • Matching
  • Difference-in-Differences
  • Instrumental Variables
  • Causal Forests

Causal Inference is the culmination of data analysis because it moves beyond prediction and asks the most important business and scientific question:

What intervention
will actually change outcomes?

Course Summary: The Essential Data Analyst → Data Scientist Toolkit

You have now covered:

  1. Pandas Foundations
  2. Data Cleaning
  3. GroupBy and Aggregation
  4. Data Integration and Joins
  5. Data Visualization
  6. Statistical Testing
  7. Linear Regression
  8. Logistic Regression
  9. Poisson Regression
  10. Negative Binomial Regression
  11. Mixed Models
  12. Random Forests
  13. XGBoost
  14. Clustering
  15. Time Series Forecasting
  16. Survival Analysis
  17. Bayesian Modeling
  18. Causal Inference

This covers a large portion of the statistical and machine learning toolkit used by practicing data analysts and data scientists in healthcare, supply chain, retail, finance, and technology.


Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading