Lesson 9: Poisson Regression: Modeling Counts and Event Frequencies

Introduction

In the previous lesson, we studied Logistic Regression.

Logistic Regression is used when the outcome is binary:

0 or 1

Examples:

  • Sold vs Not Sold
  • Disease vs No Disease
  • Readmitted vs Not Readmitted

However, many business and healthcare problems involve counts.

Examples:

Healthcare

  • Number of hospital visits
  • Number of emergency admissions
  • Number of infections

Supply Chain

  • Number of sales per SKU
  • Number of orders
  • Number of returns

Retail

  • Number of purchases
  • Number of transactions
  • Number of customer visits

Count data behaves differently from continuous data.

A count:

0
1
2
3
4
...

can never be:

-3

or

2.7

Poisson Regression was specifically designed for modeling count outcomes.


When to Use Poisson Regression

Use Poisson Regression when:

Y is a count

Examples:

OutcomePoisson?
Number of SalesYes
Number of VisitsYes
Number of ClaimsYes
RevenueNo
Inventory ValueNo
Length of StayUsually No

The Poisson Distribution

Poisson Regression assumes:

Y ~ Poisson(λ)

where:

λ

represents the expected number of events.

Example:

λ = 5

means:

Expected count = 5

Example

Suppose a SKU sells:

MonthSales Count
Jan4
Feb7
Mar5
Apr6

The outcome is:

Number of Sales

This is a count variable.


Why Not Linear Regression?

Suppose we use:

SalesCount = β₀ + β₁ × Inventory

Predictions might become:

-2.5

or

3.7

These are impossible counts.

Poisson Regression avoids this issue.


The Poisson Regression Model

Poisson Regression models:

\log(\lambda)=\beta_0+\beta_1x_1+\cdots+\beta_kx_k

where:

λ

is the expected count.

Notice:

log(λ)

instead of:

λ

This guarantees:

λ > 0

which is required for counts.


Understanding the Log Link

Suppose:

β₀ = 1

Then:

log(λ) = 1

Taking the exponential:

λ = exp(1)

which equals:

2.718

Expected count:

Approximately 2.7 events

Example Dataset

Suppose we want to predict annual SKU sales.

import pandas as pd
data = pd.DataFrame({
"Inventory":[
50,
100,
150,
200,
250
],
"SalesCount":[
2,
5,
8,
12,
15
]
})
print(data)

Visualizing the Data

import matplotlib.pyplot as plt
plt.scatter(
data["Inventory"],
data["SalesCount"]
)
plt.xlabel("Inventory")
plt.ylabel("Sales Count")
plt.title(
"Inventory vs Sales Count"
)
plt.show()

Question:

Does higher inventory
lead to more sales?

Fitting a Poisson Regression

Import Statsmodels.

import statsmodels.api as sm

Create predictors:

X = data["Inventory"]

Add intercept:

X = sm.add_constant(X)

Outcome:

y = data["SalesCount"]

Fit model:

model = sm.GLM(
y,
X,
family=sm.families.Poisson()
).fit()
print(model.summary())

Understanding the Output

Focus on:

coef

and

P>|z|

These tell us:

  • Direction of relationship
  • Statistical significance

Interpreting Coefficients

Suppose coefficient for Inventory is:

0.01

Interpretation:

Inventory affects:

log(Expected Sales Count)

which is not very intuitive.

Instead we exponentiate.


Incidence Rate Ratios

Exponentiate coefficients.

import numpy as np
np.exp(
model.params
)

Suppose:

exp(0.01)
=
1.010

Interpretation:

Each additional inventory unit
increases expected sales count
by 1.0%

This is much easier to understand.


Making Predictions

Predict expected sales count.

Suppose:

Inventory = 180

Prediction:

prediction = model.predict(
[[1,180]]
)
print(prediction)

Output:

9.8

Interpretation:

Expected sales count ≈ 10

Healthcare Example

Suppose we study emergency room visits.

Dataset:

patients = pd.DataFrame({
"Age":[
30,
40,
50,
60,
70
],
"Visits":[
1,
2,
3,
5,
7
]
})

Fit model:

X = sm.add_constant(
patients["Age"]
)
y = patients["Visits"]
model = sm.GLM(
y,
X,
family=sm.families.Poisson()
).fit()
print(model.summary())

Question:

Does age increase
emergency visits?

Supply Chain Example

Suppose we model annual sales count.

inventory = pd.DataFrame({
"Inventory":[
50,
100,
150,
200,
250
],
"SalesCount":[
2,
5,
8,
12,
15
]
})

Fit model:

X = sm.add_constant(
inventory["Inventory"]
)
y = inventory["SalesCount"]
model = sm.GLM(
y,
X,
family=sm.families.Poisson()
).fit()
print(model.summary())

Question:

How does inventory
affect sales count?

Exposure Variables

Sometimes observations have different exposure periods.

Example:

StoreSalesDays Open
A100365
B50180

Store B had less time to generate sales.

Poisson models can account for exposure.


Example

model = sm.GLM(
y,
X,
family=sm.families.Poisson(),
offset=np.log(exposure)
).fit()

This is common in:

  • Insurance
  • Healthcare
  • Operations research

A Key Assumption

Poisson Regression assumes:

Mean = Variance

This assumption is often violated.

Example:

Mean Sales Count = 5
Variance = 40

This is called:

Overdispersion

and is extremely common.


Checking for Overdispersion

Calculate:

mean_count = y.mean()
variance_count = y.var()
print(mean_count)
print(variance_count)

If:

Variance >> Mean

Poisson Regression may not be appropriate.


Why Overdispersion Matters

Suppose:

Mean = 5
Variance = 50

Poisson assumes:

Variance = 5

The model underestimates uncertainty.

This leads to:

  • Incorrect p-values
  • Overconfidence
  • Misleading conclusions

Real-World Example

SKU sales often look like:

0
0
1
0
3
0
12
0
0
7

Notice:

  • Many zeros
  • Large variation

Poisson frequently struggles here.


Enter Negative Binomial Regression

Negative Binomial Regression extends Poisson by allowing:

Variance > Mean

This is one of the most important count models in:

  • Retail analytics
  • Healthcare analytics
  • Insurance
  • Supply chain forecasting

We will study it in the next lesson.


Analyst Workflow

When modeling counts:

Visualize:

plt.hist(
y,
bins=20
)
plt.show()

Check:

print(
y.mean()
)
print(
y.var()
)

Fit model:

model = sm.GLM(
y,
X,
family=sm.families.Poisson()
).fit()

Interpret:

np.exp(
model.params
)

Evaluate:

  • Coefficients
  • p-values
  • Overdispersion

Healthcare Exercise

Predict:

Number of Hospital Visits

using:

Age
BMI
Smoking Status

Questions:

  • Which variables increase visit frequency?
  • Which variables are statistically significant?

Supply Chain Exercise

Predict:

Annual SKU Sales Count

using:

Inventory
Price
Customer Turn
Deployment Value

Questions:

  • Which factors drive sales frequency?
  • What is the expected sales count?

Lesson Summary

In this lesson we learned:

  • When to use Poisson Regression
  • The Poisson distribution
  • Count data modeling
  • Log-link functions
  • Incidence rate ratios
  • Exposure variables
  • Predictions
  • Overdispersion

Poisson Regression is the standard starting point for count data, but real-world count data often exhibits overdispersion.

In the next lesson we will study Negative Binomial Regression, one of the most important models for SKU demand, healthcare utilization, claims frequency, and other real-world count outcomes.

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading