Introduction
In the previous lesson, we studied Logistic Regression.
Logistic Regression is used when the outcome is binary:
0 or 1
Examples:
- Sold vs Not Sold
- Disease vs No Disease
- Readmitted vs Not Readmitted
However, many business and healthcare problems involve counts.
Examples:
Healthcare
- Number of hospital visits
- Number of emergency admissions
- Number of infections
Supply Chain
- Number of sales per SKU
- Number of orders
- Number of returns
Retail
- Number of purchases
- Number of transactions
- Number of customer visits
Count data behaves differently from continuous data.
A count:
01234...
can never be:
-3
or
2.7
Poisson Regression was specifically designed for modeling count outcomes.
When to Use Poisson Regression
Use Poisson Regression when:
Y is a count
Examples:
| Outcome | Poisson? |
|---|---|
| Number of Sales | Yes |
| Number of Visits | Yes |
| Number of Claims | Yes |
| Revenue | No |
| Inventory Value | No |
| Length of Stay | Usually No |
The Poisson Distribution
Poisson Regression assumes:
Y ~ Poisson(λ)
where:
λ
represents the expected number of events.
Example:
λ = 5
means:
Expected count = 5
Example
Suppose a SKU sells:
| Month | Sales Count |
|---|---|
| Jan | 4 |
| Feb | 7 |
| Mar | 5 |
| Apr | 6 |
The outcome is:
Number of Sales
This is a count variable.
Why Not Linear Regression?
Suppose we use:
SalesCount = β₀ + β₁ × Inventory
Predictions might become:
-2.5
or
3.7
These are impossible counts.
Poisson Regression avoids this issue.
The Poisson Regression Model
Poisson Regression models:
\log(\lambda)=\beta_0+\beta_1x_1+\cdots+\beta_kx_k
where:
λ
is the expected count.
Notice:
log(λ)
instead of:
λ
This guarantees:
λ > 0
which is required for counts.
Understanding the Log Link
Suppose:
β₀ = 1
Then:
log(λ) = 1
Taking the exponential:
λ = exp(1)
which equals:
2.718
Expected count:
Approximately 2.7 events
Example Dataset
Suppose we want to predict annual SKU sales.
import pandas as pddata = pd.DataFrame({ "Inventory":[ 50, 100, 150, 200, 250 ], "SalesCount":[ 2, 5, 8, 12, 15 ]})print(data)
Visualizing the Data
import matplotlib.pyplot as pltplt.scatter( data["Inventory"], data["SalesCount"])plt.xlabel("Inventory")plt.ylabel("Sales Count")plt.title( "Inventory vs Sales Count")plt.show()
Question:
Does higher inventorylead to more sales?
Fitting a Poisson Regression
Import Statsmodels.
import statsmodels.api as sm
Create predictors:
X = data["Inventory"]
Add intercept:
X = sm.add_constant(X)
Outcome:
y = data["SalesCount"]
Fit model:
model = sm.GLM( y, X, family=sm.families.Poisson()).fit()print(model.summary())
Understanding the Output
Focus on:
coef
and
P>|z|
These tell us:
- Direction of relationship
- Statistical significance
Interpreting Coefficients
Suppose coefficient for Inventory is:
0.01
Interpretation:
Inventory affects:
log(Expected Sales Count)
which is not very intuitive.
Instead we exponentiate.
Incidence Rate Ratios
Exponentiate coefficients.
import numpy as npnp.exp( model.params)
Suppose:
exp(0.01)=1.010
Interpretation:
Each additional inventory unitincreases expected sales countby 1.0%
This is much easier to understand.
Making Predictions
Predict expected sales count.
Suppose:
Inventory = 180
Prediction:
prediction = model.predict( [[1,180]])print(prediction)
Output:
9.8
Interpretation:
Expected sales count ≈ 10
Healthcare Example
Suppose we study emergency room visits.
Dataset:
patients = pd.DataFrame({ "Age":[ 30, 40, 50, 60, 70 ], "Visits":[ 1, 2, 3, 5, 7 ]})
Fit model:
X = sm.add_constant( patients["Age"])y = patients["Visits"]model = sm.GLM( y, X, family=sm.families.Poisson()).fit()print(model.summary())
Question:
Does age increaseemergency visits?
Supply Chain Example
Suppose we model annual sales count.
inventory = pd.DataFrame({ "Inventory":[ 50, 100, 150, 200, 250 ], "SalesCount":[ 2, 5, 8, 12, 15 ]})
Fit model:
X = sm.add_constant( inventory["Inventory"])y = inventory["SalesCount"]model = sm.GLM( y, X, family=sm.families.Poisson()).fit()print(model.summary())
Question:
How does inventoryaffect sales count?
Exposure Variables
Sometimes observations have different exposure periods.
Example:
| Store | Sales | Days Open |
|---|---|---|
| A | 100 | 365 |
| B | 50 | 180 |
Store B had less time to generate sales.
Poisson models can account for exposure.
Example
model = sm.GLM( y, X, family=sm.families.Poisson(), offset=np.log(exposure)).fit()
This is common in:
- Insurance
- Healthcare
- Operations research
A Key Assumption
Poisson Regression assumes:
Mean = Variance
This assumption is often violated.
Example:
Mean Sales Count = 5Variance = 40
This is called:
Overdispersion
and is extremely common.
Checking for Overdispersion
Calculate:
mean_count = y.mean()variance_count = y.var()print(mean_count)print(variance_count)
If:
Variance >> Mean
Poisson Regression may not be appropriate.
Why Overdispersion Matters
Suppose:
Mean = 5Variance = 50
Poisson assumes:
Variance = 5
The model underestimates uncertainty.
This leads to:
- Incorrect p-values
- Overconfidence
- Misleading conclusions
Real-World Example
SKU sales often look like:
00103012007
Notice:
- Many zeros
- Large variation
Poisson frequently struggles here.
Enter Negative Binomial Regression
Negative Binomial Regression extends Poisson by allowing:
Variance > Mean
This is one of the most important count models in:
- Retail analytics
- Healthcare analytics
- Insurance
- Supply chain forecasting
We will study it in the next lesson.
Analyst Workflow
When modeling counts:
Visualize:
plt.hist( y, bins=20)plt.show()
Check:
print( y.mean())print( y.var())
Fit model:
model = sm.GLM( y, X, family=sm.families.Poisson()).fit()
Interpret:
np.exp( model.params)
Evaluate:
- Coefficients
- p-values
- Overdispersion
Healthcare Exercise
Predict:
Number of Hospital Visits
using:
AgeBMISmoking Status
Questions:
- Which variables increase visit frequency?
- Which variables are statistically significant?
Supply Chain Exercise
Predict:
Annual SKU Sales Count
using:
InventoryPriceCustomer TurnDeployment Value
Questions:
- Which factors drive sales frequency?
- What is the expected sales count?
Lesson Summary
In this lesson we learned:
- When to use Poisson Regression
- The Poisson distribution
- Count data modeling
- Log-link functions
- Incidence rate ratios
- Exposure variables
- Predictions
- Overdispersion
Poisson Regression is the standard starting point for count data, but real-world count data often exhibits overdispersion.
In the next lesson we will study Negative Binomial Regression, one of the most important models for SKU demand, healthcare utilization, claims frequency, and other real-world count outcomes.

Leave a Reply