Lesson 18: Causal Inference: Determining Whether X Actually Causes Y

pj316

4–6 minutes

Coding, course: python for data analysis, courses, data-science, python, Statistics

Introduction

Throughout this course, we have built increasingly sophisticated predictive models.

We learned:

Linear Regression
Logistic Regression
Random Forests
XGBoost
Survival Analysis
Bayesian Modeling

All of these models answer questions like:

What is likely to happen?

How strongly are two variables associated?

But there is a deeper question.

Did X actually cause Y?

This is the central question of causal inference.

Why Correlation Is Not Causation

Suppose we observe:

Ice Cream Sales	Drowning Incidents
High	High
Low	Low

A strong correlation exists.

Should we conclude:

Ice cream causes drowning?

Of course not.

A hidden variable exists:

Temperature

Hot weather causes:

More ice cream sales
More swimming
More drowning incidents

This hidden variable is called a:

Confounder

The Fundamental Problem

For each person:

We observe:

What happened

We do NOT observe:

			
What would have happened
otherwise

Example:

Patient receives a treatment.

Outcome:

Recovered

Question:

			
Would they have recovered
without treatment?

We never observe both realities simultaneously.

This is called:

			
The Fundamental Problem
of Causal Inference

Potential Outcomes Framework

Suppose:

Treatment = 1

means:

Received Treatment

and

Treatment = 0

means:

No Treatment

Each individual has two potential outcomes:

Y(1)

Outcome if treated.

Y(0)

Outcome if untreated.

The causal effect is:

Y(1)-Y(0)

The problem:

We only observe one.

Average Treatment Effect

The most common causal quantity.

ATE=E[Y(1)-Y(0)]

Interpretation:

			
Average effect
of treatment

Example:

Treatment reduces hospital stay by:

2 days

Then:

ATE = -2

Randomized Controlled Trials

The gold standard.

Example:

Group	Treatment
A	Yes
B	No

Random assignment ensures:

Groups are comparable

Therefore:

			
Differences
can be interpreted causally

Healthcare Example

Question:

			
Does a new medication
reduce mortality?

Randomly assign:

Treatment

Placebo

Compare outcomes.

Supply Chain Example

Question:

			
Does a new replenishment
strategy increase sales?

Randomly assign stores:

New Strategy

Old Strategy

Compare performance.

Observational Data Problems

Most analysts do NOT have randomized experiments.

Example:

Patients choose treatment

Stores choose inventory levels

This introduces:

Selection Bias

Confounding Variables

Suppose:

Higher Inventory

is associated with:

Higher Sales

Can we conclude:

Inventory causes sales?

Maybe not.

Possible confounder:

Store Quality

Better stores:

Hold more inventory
Generate more sales

Without controlling for store quality:

			
Causal conclusions
are invalid

Directed Acyclic Graphs (DAGs)

DAGs help visualize causal relationships.

Example:

			
Store Quality
      ↓
Inventory
      ↓
Sales

		

			
Store Quality
    ↘     ↙
     Sales

These diagrams help identify:

Confounders
Mediators
Colliders

Regression Adjustment

One of the simplest causal methods.

Example:

			
import statsmodels.formula.api as smf
model = smf.ols(
    "Sales ~ Inventory + StoreQuality",
    data=df
).fit()
print(model.summary())

		

By controlling for:

StoreQuality

we attempt to isolate:

Inventory Effect

Propensity Scores

A powerful observational method.

Question:

			
How likely was someone
to receive treatment?

Estimate:

P(Treatment=1|X)

using logistic regression.

Example

			
from sklearn.linear_model import LogisticRegression
ps_model = LogisticRegression()
ps_model.fit(
    X,
    treatment
)
propensity_scores = (
    ps_model.predict_proba(X)
    [:,1]
)

		

These probabilities help create balanced groups.

Matching

Match similar observations.

Example:

Patient A:

			
Age = 60
BMI = 30

Treatment:

Yes

Find similar patient:

			
Age = 61
BMI = 29

Treatment:

No

Compare outcomes.

Difference-in-Differences

Widely used in business analytics.

Suppose:

Store	Before	After
Treatment	100	140
Control	100	110

Treatment change:

+40

Control change:

+10

Causal estimate:

40-10=30

Interpretation:

			
Estimated causal effect
= +30

Difference-in-Differences in Python

			
import statsmodels.formula.api as smf
model = smf.ols(
    "Sales ~ Treatment + Post + Treatment:Post",
    data=df
).fit()
print(model.summary())

		

The interaction term estimates the treatment effect.

Instrumental Variables

Used when confounding is severe.

Requirements:

Instrument must:

Affect treatment
Not directly affect outcome

Example:

Healthcare:

Distance to Hospital

may influence:

Treatment Choice

but not directly:

Patient Outcome

Causal Forests

Modern machine learning approach.

Extension of:

Random Forests

Designed to estimate:

Heterogeneous Treatment Effects

Question:

Which customers benefit most?

instead of:

What is the average effect?

Healthcare Example

Question:

			
Does a medication
reduce mortality?

Potential confounders:

Age
BMI
Smoking
Diabetes

Methods:

Randomized Trial
Propensity Scores
Matching
Instrumental Variables

Supply Chain Example

Question:

			
Does increasing deployment
increase sales?

Potential confounders:

Store Quality
Customer Base
Local Market Conditions

Methods:

Regression Adjustment
Matching
Difference-in-Differences

Why Causal Inference Matters

Machine Learning answers:

Who will buy?

Causal Inference answers:

How can we increase sales?

Machine Learning answers:

Who will be readmitted?

Causal Inference answers:

How can we prevent readmission?

Prediction and causation are different.

Typical Analyst Workflow

Step 1

Define:

Treatment

and

Outcome

Step 2

Identify confounders.

Step 3

Draw a DAG.

Step 4

Choose a method:

Randomization
Regression
Matching
Propensity Scores
Difference-in-Differences
Instrumental Variables

Step 5

Estimate treatment effect.

Step 6

Interpret causal impact.

Practical Healthcare Exercise

Question:

			
Does a new medication
reduce readmission?

Treatment:

Medication

Outcome:

Readmission

Confounders:

Age
BMI
Diabetes
Prior Admissions

Estimate:

Average Treatment Effect

Practical Supply Chain Exercise

Question:

			
Does increasing deployment
increase sales?

Treatment:

Deployment Increase

Outcome:

Sales

Confounders:

Customer Turn
Market Size
Historical Sales

Estimate:

Causal Impact

Lesson Summary

In this lesson we learned:

Correlation vs Causation
Potential Outcomes
Average Treatment Effect
Randomized Controlled Trials
Confounding
DAGs
Regression Adjustment
Propensity Scores
Matching
Difference-in-Differences
Instrumental Variables
Causal Forests

Causal Inference is the culmination of data analysis because it moves beyond prediction and asks the most important business and scientific question:

			
What intervention
will actually change outcomes?

Course Summary: The Essential Data Analyst → Data Scientist Toolkit

You have now covered:

Pandas Foundations
Data Cleaning
GroupBy and Aggregation
Data Integration and Joins
Data Visualization
Statistical Testing
Linear Regression
Logistic Regression
Poisson Regression
Negative Binomial Regression
Mixed Models
Random Forests
XGBoost
Clustering
Time Series Forecasting
Survival Analysis
Bayesian Modeling
Causal Inference

This covers a large portion of the statistical and machine learning toolkit used by practicing data analysts and data scientists in healthcare, supply chain, retail, finance, and technology.

nerd-ish

Leave a ReplyCancel reply