Why A/B Testing Exists

Suppose a company launches a new advertising campaign.

After the campaign:

$$Sales\ Before=1,000,000$$

$$Sales\ After=1,200,000$$

Management concludes:

“The campaign increased sales by $200,000.”

But did it?

Not necessarily.

Sales might have increased because:

Seasonality
Economic growth
Competitor actions
Random variation
Product improvements

The biggest mistake in analytics is confusing:

Correlation

with

Causation

A/B testing exists to solve this problem.

Learning Objectives

By the end of this lesson, you should be able to:

Understand causality
Understand randomized experiments
Design A/B tests
Calculate lift
Perform hypothesis testing
Understand p-values
Understand statistical power
Measure incrementality
Apply experimentation in healthcare and supply chains

Correlation vs Causation

Suppose ice cream sales and drowning incidents both increase during summer.

The data may show:

$$Correlation(IceCream,Drowning)>0$$

Does ice cream cause drowning?

No.

A hidden variable exists:

Temperature.

This illustrates why observational data alone is often misleading.

The Goal of A/B Testing

We want to estimate:

$$Treatment\ Effect$$

which is:

$$Treatment\ Effect=Outcome_{Treatment}-Outcome_{Control}$$

The challenge is that we cannot observe both outcomes for the same person simultaneously.

The Fundamental Problem of Causal Inference

Suppose Customer A receives an advertisement.

We observe:

$$Sales_A=500$$

What would Customer A have spent without the advertisement?

We never observe that value.

This missing outcome is called the:

Counterfactual

The Solution: Randomization

Instead of studying one customer, we study many.

Randomly assign customers into:

Control Group

No treatment.

Treatment Group

Receives treatment.

Randomization helps ensure groups are statistically similar.

Example

Suppose:

Group	Customers
Control	5,000
Treatment	5,000

Treatment group receives an email campaign.

Control group does not.

After one month:

Control sales:

$$500,000$$

Treatment sales:

$$600,000$$

Estimating Lift

Lift measures the improvement generated by treatment.

Formula:

$$Lift=\frac{Treatment-Control}{Control}$$

Using the example:

$$Lift=\frac{600000-500000}{500000}=0.20$$

Therefore:

$$Lift=20%$$

The campaign increased sales by 20%.

Absolute Lift

Sometimes we report:

$$Absolute\ Lift=Treatment-Control$$

Example:

$$Absolute\ Lift=600000-500000=100000$$

Meaning:

The campaign generated:

$$$100,000$$

of additional sales.

Randomized Controlled Trials

A/B tests are essentially randomized controlled trials.

Widely used in:

Marketing
Medicine
Technology
Product Analytics

Healthcare Example

Suppose researchers test a new treatment.

Control Group:

Current treatment.

Treatment Group:

New treatment.

Recovery rates:

$$Recovery_C=70%$$

$$Recovery_T=80%$$

Lift:

$$Lift=\frac{0.80-0.70}{0.70}=14.3%$$

Null Hypothesis

Statistics begins with a skeptical position.

Null hypothesis:

$$H_0:\mu_T=\mu_C$$

Meaning:

No treatment effect exists.

Alternative Hypothesis

Alternative hypothesis:

$$H_A:\mu_T\neq\mu_C$$

Meaning:

A treatment effect exists.

Why We Need Hypothesis Testing

Suppose:

Control conversion rate:

$$5%$$

Treatment conversion rate:

$$5.2%$$

Is the increase real?

Or random noise?

Hypothesis testing helps answer this.

Test Statistic

A simple z-statistic is:

$$Z=\frac{\hat p_T-\hat p_C}{SE}$$

where:

$$\hat p_T$$ = treatment conversion rate
$$\hat p_C$$ = control conversion rate
$$SE$$ = standard error

P-Values

The p-value measures how surprising the observed result would be if:

$$H_0$$

were true.

Small p-values suggest evidence against the null hypothesis.

Typical threshold:

$$p<0.05$$

Common Misunderstanding

A p-value is not:

$$P(H_0\ True)$$

It is:

$$P(Data\ |\ H_0)$$

This distinction is important.

Confidence Intervals

Instead of a single estimate, we often compute an interval.

Example:

$$Lift=5%$$

95% confidence interval:

$$[2%,8%]$$

Interpretation:

The true lift is plausibly between:

$$2%$$

and

$$8%$$

Statistical Power

Power measures:

The probability of detecting a real effect.

Formula:

$$Power=P(Reject\ H_0\ |\ H_A\ True)$$

Typical target:

$$80%$$

$$90%$$

Why Power Matters

Suppose:

Actual lift:

$$5%$$

Sample size:

$$n=50$$

The experiment may miss the effect entirely.

Low-powered studies often produce unreliable conclusions.

Sample Size Calculation

Larger sample sizes produce:

Smaller variance
Narrower confidence intervals
More reliable conclusions

Approximate relationship:

$$SE\propto\frac{1}{\sqrt n}$$

Doubling sample size does not halve uncertainty.

Instead:

Uncertainty decreases with the square root of sample size.

Incrementality

One of the most important concepts in marketing analytics.

Incrementality asks:

What would have happened anyway?

Suppose:

Observed sales:

$$1,500,000$$

Estimated baseline sales:

$$1,200,000$$

Incremental sales:

$$Incremental=1,500,000-1,200,000$$

$$Incremental=300,000$$

Only the incremental portion is credited to marketing.

Why Incrementality Matters

Suppose:

Campaign revenue:

$$500,000$$

Incremental revenue:

$$100,000$$

The campaign did not generate:

$$500,000$$

of value.

It generated:

$$100,000$$

of value.

This distinction saves companies millions of dollars.

Geo Experiments

Sometimes individual randomization is impossible.

Alternative:

Randomize regions.

Example:

Region	Treatment
Alberta	Yes
Ontario	No
BC	Yes
Manitoba	No

Compare outcomes across regions.

This is called:

Geo Testing

Holdout Groups

A holdout group never receives treatment.

Example:

Group	Customers
Treatment	90,000
Holdout	10,000

The holdout acts as the counterfactual.

Many large technology companies use holdout experiments continuously.

Supply Chain Example

Suppose a wholesaler tests a new inventory recommendation system.

Treatment retailers:

Receive recommendations.

Control retailers:

Business as usual.

Results:

Control sales:

$$400,000$$

Treatment sales:

$$500,000$$

Lift:

$$Lift=\frac{500000-400000}{400000}=25%$$

The recommendation system appears effective.

Bayesian A/B Testing

Instead of p-values, Bayesian methods estimate:

$$P(Treatment\ Better\ Than\ Control\ |\ Data)$$

Example:

$$P(T>C|Data)=0.97$$

Interpretation:

There is a 97% probability treatment outperforms control.

Many practitioners find this interpretation more intuitive.

Python Example: Lift Calculation

			
control = 500000
treatment = 600000
lift = (treatment - control) / control
print(f"Lift: {lift:.2%}")

Output:

Lift: 20.00%

Python Example: Two-Proportion Z-Test

			
from statsmodels.stats.proportion import proportions_ztest
successes = [520, 500]
observations = [10000, 10000]
z_stat, p_value = proportions_ztest(
    successes,
    observations
)
print(z_stat)
print(p_value)

		

Complete Business Example

Suppose:

Control Group:

$$10000$$

customers

Conversion rate:

$$4%$$

Treatment Group:

$$10000$$

customers

Conversion rate:

$$5%$$

Absolute lift:

$$5%-4%=1%$$

Relative lift:

$$\frac{5%-4%}{4%}=25%$$

This means the campaign improved conversions by:

$$25%$$

relative to baseline.

Key Takeaways

A/B testing is the foundation of causal inference in business.

The most important formulas introduced today are:

$$Treatment\ Effect=Outcome_T-Outcome_C$$

$$Lift=\frac{Treatment-Control}{Control}$$

$$H_0:\mu_T=\mu_C$$

$$H_A:\mu_T\neq\mu_C$$

$$Z=\frac{\hat p_T-\hat p_C}{SE}$$

$$Power=P(Reject\ H_0\ |\ H_A\ True)$$

$$SE\propto\frac{1}{\sqrt n}$$

Modern experimentation helps organizations measure:

Advertising effectiveness
Product improvements
Website optimization
Medical interventions
Inventory strategies

Most importantly, it helps distinguish:

Correlation

from

Causation

Exercises

Calculate lift when treatment sales are $750,000 and control sales are $600,000.
Explain why randomization is important.
What is the difference between correlation and causation?
Why do we need holdout groups?
Explain incrementality to a marketing manager.

References

Kohavi, Ron, Tang, Diane, Xu, Ya. Trustworthy Online Controlled Experiments.
Montgomery, Douglas C. Design and Analysis of Experiments.
Imbens, Guido W., Rubin, Donald B. Causal Inference for Statistics, Social, and Biomedical Sciences.
Gelman, Andrew et al. Bayesian Data Analysis.
Hernán, Miguel A., Robins, James M. Causal Inference: What If.

Next Lesson

Lesson 8: Causal Inference for Data Scientists — Potential Outcomes, Propensity Scores, Difference-in-Differences, and Synthetic Controls, where we learn how to estimate causal effects when randomized A/B tests are impossible.

nerd-ish

Leave a ReplyCancel reply

Measure Theory Lesson 30: Radon Measures

Measure Theory Lesson 52: Bott Periodicity — The Miracle Behind K-Theory

Measure Theory Lesson 51: K-Theory for Operator Algebras — Measuring the Shape of a Space Without Points

Lesson 7: A/B Testing and Incrementality Measurement — Proving That Marketing Actually Works

Why A/B Testing Exists

Learning Objectives

Correlation vs Causation

The Goal of A/B Testing

The Fundamental Problem of Causal Inference

The Solution: Randomization

Control Group

Treatment Group

Example

Estimating Lift

Absolute Lift

Randomized Controlled Trials

Healthcare Example

Null Hypothesis

Alternative Hypothesis

Why We Need Hypothesis Testing

Test Statistic

P-Values

Common Misunderstanding

Confidence Intervals

Statistical Power

Why Power Matters

Sample Size Calculation

Incrementality

Why Incrementality Matters

Geo Experiments

Holdout Groups

Supply Chain Example

Bayesian A/B Testing

Python Example: Lift Calculation

Python Example: Two-Proportion Z-Test

Complete Business Example

Key Takeaways

Exercises

References

Next Lesson

Share this:

Like this:

Related posts:

Leave a ReplyCancel reply

Discover more from nerd-ish