Why A/B Testing Exists
Suppose a company launches a new advertising campaign.
After the campaign:
$$Sales\ Before=1,000,000$$
$$Sales\ After=1,200,000$$
Management concludes:
“The campaign increased sales by $200,000.”
But did it?
Not necessarily.
Sales might have increased because:
- Seasonality
- Economic growth
- Competitor actions
- Random variation
- Product improvements
The biggest mistake in analytics is confusing:
Correlation
with
Causation
A/B testing exists to solve this problem.
Learning Objectives
By the end of this lesson, you should be able to:
- Understand causality
- Understand randomized experiments
- Design A/B tests
- Calculate lift
- Perform hypothesis testing
- Understand p-values
- Understand statistical power
- Measure incrementality
- Apply experimentation in healthcare and supply chains
Correlation vs Causation
Suppose ice cream sales and drowning incidents both increase during summer.
The data may show:
$$Correlation(IceCream,Drowning)>0$$
Does ice cream cause drowning?
No.
A hidden variable exists:
Temperature.
This illustrates why observational data alone is often misleading.
The Goal of A/B Testing
We want to estimate:
$$Treatment\ Effect$$
which is:
$$Treatment\ Effect=Outcome_{Treatment}-Outcome_{Control}$$
The challenge is that we cannot observe both outcomes for the same person simultaneously.
The Fundamental Problem of Causal Inference
Suppose Customer A receives an advertisement.
We observe:
$$Sales_A=500$$
What would Customer A have spent without the advertisement?
We never observe that value.
This missing outcome is called the:
Counterfactual
The Solution: Randomization
Instead of studying one customer, we study many.
Randomly assign customers into:
Control Group
No treatment.
Treatment Group
Receives treatment.
Randomization helps ensure groups are statistically similar.
Example
Suppose:
| Group | Customers |
|---|---|
| Control | 5,000 |
| Treatment | 5,000 |
Treatment group receives an email campaign.
Control group does not.
After one month:
Control sales:
$$500,000$$
Treatment sales:
$$600,000$$
Estimating Lift
Lift measures the improvement generated by treatment.
Formula:
$$Lift=\frac{Treatment-Control}{Control}$$
Using the example:
$$Lift=\frac{600000-500000}{500000}=0.20$$
Therefore:
$$Lift=20%$$
The campaign increased sales by 20%.
Absolute Lift
Sometimes we report:
$$Absolute\ Lift=Treatment-Control$$
Example:
$$Absolute\ Lift=600000-500000=100000$$
Meaning:
The campaign generated:
$$$100,000$$
of additional sales.
Randomized Controlled Trials
A/B tests are essentially randomized controlled trials.
Widely used in:
- Marketing
- Medicine
- Technology
- Product Analytics
Healthcare Example
Suppose researchers test a new treatment.
Control Group:
Current treatment.
Treatment Group:
New treatment.
Recovery rates:
$$Recovery_C=70%$$
$$Recovery_T=80%$$
Lift:
$$Lift=\frac{0.80-0.70}{0.70}=14.3%$$
Null Hypothesis
Statistics begins with a skeptical position.
Null hypothesis:
$$H_0:\mu_T=\mu_C$$
Meaning:
No treatment effect exists.
Alternative Hypothesis
Alternative hypothesis:
$$H_A:\mu_T\neq\mu_C$$
Meaning:
A treatment effect exists.
Why We Need Hypothesis Testing
Suppose:
Control conversion rate:
$$5%$$
Treatment conversion rate:
$$5.2%$$
Is the increase real?
Or random noise?
Hypothesis testing helps answer this.
Test Statistic
A simple z-statistic is:
$$Z=\frac{\hat p_T-\hat p_C}{SE}$$
where:
- $$\hat p_T$$ = treatment conversion rate
- $$\hat p_C$$ = control conversion rate
- $$SE$$ = standard error
P-Values
The p-value measures how surprising the observed result would be if:
$$H_0$$
were true.
Small p-values suggest evidence against the null hypothesis.
Typical threshold:
$$p<0.05$$
Common Misunderstanding
A p-value is not:
$$P(H_0\ True)$$
It is:
$$P(Data\ |\ H_0)$$
This distinction is important.
Confidence Intervals
Instead of a single estimate, we often compute an interval.
Example:
$$Lift=5%$$
95% confidence interval:
$$[2%,8%]$$
Interpretation:
The true lift is plausibly between:
$$2%$$
and
$$8%$$
Statistical Power
Power measures:
The probability of detecting a real effect.
Formula:
$$Power=P(Reject\ H_0\ |\ H_A\ True)$$
Typical target:
$$80%$$
or
$$90%$$
Why Power Matters
Suppose:
Actual lift:
$$5%$$
Sample size:
$$n=50$$
The experiment may miss the effect entirely.
Low-powered studies often produce unreliable conclusions.
Sample Size Calculation
Larger sample sizes produce:
- Smaller variance
- Narrower confidence intervals
- More reliable conclusions
Approximate relationship:
$$SE\propto\frac{1}{\sqrt n}$$
Doubling sample size does not halve uncertainty.
Instead:
Uncertainty decreases with the square root of sample size.
Incrementality
One of the most important concepts in marketing analytics.
Incrementality asks:
What would have happened anyway?
Suppose:
Observed sales:
$$1,500,000$$
Estimated baseline sales:
$$1,200,000$$
Incremental sales:
$$Incremental=1,500,000-1,200,000$$
$$Incremental=300,000$$
Only the incremental portion is credited to marketing.
Why Incrementality Matters
Suppose:
Campaign revenue:
$$500,000$$
Incremental revenue:
$$100,000$$
The campaign did not generate:
$$500,000$$
of value.
It generated:
$$100,000$$
of value.
This distinction saves companies millions of dollars.
Geo Experiments
Sometimes individual randomization is impossible.
Alternative:
Randomize regions.
Example:
| Region | Treatment |
|---|---|
| Alberta | Yes |
| Ontario | No |
| BC | Yes |
| Manitoba | No |
Compare outcomes across regions.
This is called:
Geo Testing
Holdout Groups
A holdout group never receives treatment.
Example:
| Group | Customers |
|---|---|
| Treatment | 90,000 |
| Holdout | 10,000 |
The holdout acts as the counterfactual.
Many large technology companies use holdout experiments continuously.
Supply Chain Example
Suppose a wholesaler tests a new inventory recommendation system.
Treatment retailers:
Receive recommendations.
Control retailers:
Business as usual.
Results:
Control sales:
$$400,000$$
Treatment sales:
$$500,000$$
Lift:
$$Lift=\frac{500000-400000}{400000}=25%$$
The recommendation system appears effective.
Bayesian A/B Testing
Instead of p-values, Bayesian methods estimate:
$$P(Treatment\ Better\ Than\ Control\ |\ Data)$$
Example:
$$P(T>C|Data)=0.97$$
Interpretation:
There is a 97% probability treatment outperforms control.
Many practitioners find this interpretation more intuitive.
Python Example: Lift Calculation
control = 500000treatment = 600000lift = (treatment - control) / controlprint(f"Lift: {lift:.2%}")
Output:
Lift: 20.00%
Python Example: Two-Proportion Z-Test
from statsmodels.stats.proportion import proportions_ztestsuccesses = [520, 500]observations = [10000, 10000]z_stat, p_value = proportions_ztest( successes, observations)print(z_stat)print(p_value)
Complete Business Example
Suppose:
Control Group:
$$10000$$
customers
Conversion rate:
$$4%$$
Treatment Group:
$$10000$$
customers
Conversion rate:
$$5%$$
Absolute lift:
$$5%-4%=1%$$
Relative lift:
$$\frac{5%-4%}{4%}=25%$$
This means the campaign improved conversions by:
$$25%$$
relative to baseline.
Key Takeaways
A/B testing is the foundation of causal inference in business.
The most important formulas introduced today are:
$$Treatment\ Effect=Outcome_T-Outcome_C$$
$$Lift=\frac{Treatment-Control}{Control}$$
$$H_0:\mu_T=\mu_C$$
$$H_A:\mu_T\neq\mu_C$$
$$Z=\frac{\hat p_T-\hat p_C}{SE}$$
$$Power=P(Reject\ H_0\ |\ H_A\ True)$$
$$SE\propto\frac{1}{\sqrt n}$$
Modern experimentation helps organizations measure:
- Advertising effectiveness
- Product improvements
- Website optimization
- Medical interventions
- Inventory strategies
Most importantly, it helps distinguish:
Correlation
from
Causation
Exercises
- Calculate lift when treatment sales are $750,000 and control sales are $600,000.
- Explain why randomization is important.
- What is the difference between correlation and causation?
- Why do we need holdout groups?
- Explain incrementality to a marketing manager.
References
- Kohavi, Ron, Tang, Diane, Xu, Ya. Trustworthy Online Controlled Experiments.
- Montgomery, Douglas C. Design and Analysis of Experiments.
- Imbens, Guido W., Rubin, Donald B. Causal Inference for Statistics, Social, and Biomedical Sciences.
- Gelman, Andrew et al. Bayesian Data Analysis.
- Hernán, Miguel A., Robins, James M. Causal Inference: What If.
Next Lesson
Lesson 8: Causal Inference for Data Scientists — Potential Outcomes, Propensity Scores, Difference-in-Differences, and Synthetic Controls, where we learn how to estimate causal effects when randomized A/B tests are impossible.

Leave a Reply