Lesson 7: A/B Testing and Incrementality Measurement — Proving That Marketing Actually Works

Why A/B Testing Exists

Suppose a company launches a new advertising campaign.

After the campaign:

$$Sales\ Before=1,000,000$$

$$Sales\ After=1,200,000$$

Management concludes:

“The campaign increased sales by $200,000.”

But did it?

Not necessarily.

Sales might have increased because:

  • Seasonality
  • Economic growth
  • Competitor actions
  • Random variation
  • Product improvements

The biggest mistake in analytics is confusing:

Correlation

with

Causation

A/B testing exists to solve this problem.


Learning Objectives

By the end of this lesson, you should be able to:

  • Understand causality
  • Understand randomized experiments
  • Design A/B tests
  • Calculate lift
  • Perform hypothesis testing
  • Understand p-values
  • Understand statistical power
  • Measure incrementality
  • Apply experimentation in healthcare and supply chains

Correlation vs Causation

Suppose ice cream sales and drowning incidents both increase during summer.

The data may show:

$$Correlation(IceCream,Drowning)>0$$

Does ice cream cause drowning?

No.

A hidden variable exists:

Temperature.

This illustrates why observational data alone is often misleading.


The Goal of A/B Testing

We want to estimate:

$$Treatment\ Effect$$

which is:

$$Treatment\ Effect=Outcome_{Treatment}-Outcome_{Control}$$

The challenge is that we cannot observe both outcomes for the same person simultaneously.


The Fundamental Problem of Causal Inference

Suppose Customer A receives an advertisement.

We observe:

$$Sales_A=500$$

What would Customer A have spent without the advertisement?

We never observe that value.

This missing outcome is called the:

Counterfactual


The Solution: Randomization

Instead of studying one customer, we study many.

Randomly assign customers into:

Control Group

No treatment.

Treatment Group

Receives treatment.

Randomization helps ensure groups are statistically similar.


Example

Suppose:

GroupCustomers
Control5,000
Treatment5,000

Treatment group receives an email campaign.

Control group does not.

After one month:

Control sales:

$$500,000$$

Treatment sales:

$$600,000$$


Estimating Lift

Lift measures the improvement generated by treatment.

Formula:

$$Lift=\frac{Treatment-Control}{Control}$$

Using the example:

$$Lift=\frac{600000-500000}{500000}=0.20$$

Therefore:

$$Lift=20%$$

The campaign increased sales by 20%.


Absolute Lift

Sometimes we report:

$$Absolute\ Lift=Treatment-Control$$

Example:

$$Absolute\ Lift=600000-500000=100000$$

Meaning:

The campaign generated:

$$$100,000$$

of additional sales.


Randomized Controlled Trials

A/B tests are essentially randomized controlled trials.

Widely used in:

  • Marketing
  • Medicine
  • Technology
  • Product Analytics

Healthcare Example

Suppose researchers test a new treatment.

Control Group:

Current treatment.

Treatment Group:

New treatment.

Recovery rates:

$$Recovery_C=70%$$

$$Recovery_T=80%$$

Lift:

$$Lift=\frac{0.80-0.70}{0.70}=14.3%$$


Null Hypothesis

Statistics begins with a skeptical position.

Null hypothesis:

$$H_0:\mu_T=\mu_C$$

Meaning:

No treatment effect exists.


Alternative Hypothesis

Alternative hypothesis:

$$H_A:\mu_T\neq\mu_C$$

Meaning:

A treatment effect exists.


Why We Need Hypothesis Testing

Suppose:

Control conversion rate:

$$5%$$

Treatment conversion rate:

$$5.2%$$

Is the increase real?

Or random noise?

Hypothesis testing helps answer this.


Test Statistic

A simple z-statistic is:

$$Z=\frac{\hat p_T-\hat p_C}{SE}$$

where:

  • $$\hat p_T$$ = treatment conversion rate
  • $$\hat p_C$$ = control conversion rate
  • $$SE$$ = standard error

P-Values

The p-value measures how surprising the observed result would be if:

$$H_0$$

were true.

Small p-values suggest evidence against the null hypothesis.

Typical threshold:

$$p<0.05$$


Common Misunderstanding

A p-value is not:

$$P(H_0\ True)$$

It is:

$$P(Data\ |\ H_0)$$

This distinction is important.


Confidence Intervals

Instead of a single estimate, we often compute an interval.

Example:

$$Lift=5%$$

95% confidence interval:

$$[2%,8%]$$

Interpretation:

The true lift is plausibly between:

$$2%$$

and

$$8%$$


Statistical Power

Power measures:

The probability of detecting a real effect.

Formula:

$$Power=P(Reject\ H_0\ |\ H_A\ True)$$

Typical target:

$$80%$$

or

$$90%$$


Why Power Matters

Suppose:

Actual lift:

$$5%$$

Sample size:

$$n=50$$

The experiment may miss the effect entirely.

Low-powered studies often produce unreliable conclusions.


Sample Size Calculation

Larger sample sizes produce:

  • Smaller variance
  • Narrower confidence intervals
  • More reliable conclusions

Approximate relationship:

$$SE\propto\frac{1}{\sqrt n}$$

Doubling sample size does not halve uncertainty.

Instead:

Uncertainty decreases with the square root of sample size.


Incrementality

One of the most important concepts in marketing analytics.

Incrementality asks:

What would have happened anyway?

Suppose:

Observed sales:

$$1,500,000$$

Estimated baseline sales:

$$1,200,000$$

Incremental sales:

$$Incremental=1,500,000-1,200,000$$

$$Incremental=300,000$$

Only the incremental portion is credited to marketing.


Why Incrementality Matters

Suppose:

Campaign revenue:

$$500,000$$

Incremental revenue:

$$100,000$$

The campaign did not generate:

$$500,000$$

of value.

It generated:

$$100,000$$

of value.

This distinction saves companies millions of dollars.


Geo Experiments

Sometimes individual randomization is impossible.

Alternative:

Randomize regions.

Example:

RegionTreatment
AlbertaYes
OntarioNo
BCYes
ManitobaNo

Compare outcomes across regions.

This is called:

Geo Testing


Holdout Groups

A holdout group never receives treatment.

Example:

GroupCustomers
Treatment90,000
Holdout10,000

The holdout acts as the counterfactual.

Many large technology companies use holdout experiments continuously.


Supply Chain Example

Suppose a wholesaler tests a new inventory recommendation system.

Treatment retailers:

Receive recommendations.

Control retailers:

Business as usual.

Results:

Control sales:

$$400,000$$

Treatment sales:

$$500,000$$

Lift:

$$Lift=\frac{500000-400000}{400000}=25%$$

The recommendation system appears effective.


Bayesian A/B Testing

Instead of p-values, Bayesian methods estimate:

$$P(Treatment\ Better\ Than\ Control\ |\ Data)$$

Example:

$$P(T>C|Data)=0.97$$

Interpretation:

There is a 97% probability treatment outperforms control.

Many practitioners find this interpretation more intuitive.


Python Example: Lift Calculation

control = 500000
treatment = 600000
lift = (treatment - control) / control
print(f"Lift: {lift:.2%}")

Output:

Lift: 20.00%

Python Example: Two-Proportion Z-Test

from statsmodels.stats.proportion import proportions_ztest
successes = [520, 500]
observations = [10000, 10000]
z_stat, p_value = proportions_ztest(
successes,
observations
)
print(z_stat)
print(p_value)

Complete Business Example

Suppose:

Control Group:

$$10000$$

customers

Conversion rate:

$$4%$$

Treatment Group:

$$10000$$

customers

Conversion rate:

$$5%$$

Absolute lift:

$$5%-4%=1%$$

Relative lift:

$$\frac{5%-4%}{4%}=25%$$

This means the campaign improved conversions by:

$$25%$$

relative to baseline.


Key Takeaways

A/B testing is the foundation of causal inference in business.

The most important formulas introduced today are:

$$Treatment\ Effect=Outcome_T-Outcome_C$$

$$Lift=\frac{Treatment-Control}{Control}$$

$$H_0:\mu_T=\mu_C$$

$$H_A:\mu_T\neq\mu_C$$

$$Z=\frac{\hat p_T-\hat p_C}{SE}$$

$$Power=P(Reject\ H_0\ |\ H_A\ True)$$

$$SE\propto\frac{1}{\sqrt n}$$

Modern experimentation helps organizations measure:

  • Advertising effectiveness
  • Product improvements
  • Website optimization
  • Medical interventions
  • Inventory strategies

Most importantly, it helps distinguish:

Correlation

from

Causation


Exercises

  1. Calculate lift when treatment sales are $750,000 and control sales are $600,000.
  2. Explain why randomization is important.
  3. What is the difference between correlation and causation?
  4. Why do we need holdout groups?
  5. Explain incrementality to a marketing manager.

References

  1. Kohavi, Ron, Tang, Diane, Xu, Ya. Trustworthy Online Controlled Experiments.
  2. Montgomery, Douglas C. Design and Analysis of Experiments.
  3. Imbens, Guido W., Rubin, Donald B. Causal Inference for Statistics, Social, and Biomedical Sciences.
  4. Gelman, Andrew et al. Bayesian Data Analysis.
  5. Hernán, Miguel A., Robins, James M. Causal Inference: What If.

Next Lesson

Lesson 8: Causal Inference for Data Scientists — Potential Outcomes, Propensity Scores, Difference-in-Differences, and Synthetic Controls, where we learn how to estimate causal effects when randomized A/B tests are impossible.

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading