Introduction

In the previous lesson, we studied Linear Regression.

Linear Regression works well when the outcome is continuous.

Examples:

Sales Revenue
Length of Stay
Blood Pressure
Diamond Price

However, many real-world problems involve outcomes with only two possibilities.

Examples:

Healthcare

Readmitted or Not Readmitted
Disease or No Disease
Survived or Died

Supply Chain

Sold or Not Sold
Customer Churned or Stayed
Returned or Not Returned

Marketing

Purchased or Did Not Purchase
Clicked or Did Not Click

Linear regression is not appropriate for these situations because probabilities must lie between:

0 and 1

Logistic Regression solves this problem.

It predicts probabilities.

The Goal of Logistic Regression

Suppose we want to predict:

Will a deployed diamond sell?

Outcome:

Sold
1
0
1
1
0

where:

1 = Sold

0 = Not Sold

Instead of predicting:

0 or 1 directly

Logistic Regression predicts:

Probability of Sale

For example:

0.82

meaning:

82% chance of sale

Why Linear Regression Fails

Suppose we fit:

Sold = β₀ + β₁ × Age

Predictions could become:

1.4

−0.3

These are impossible probabilities.

Probabilities must satisfy:

0 ≤ p ≤ 1

Logistic Regression guarantees this.

The Logistic Function

Logistic Regression uses the sigmoid curve.

p=\frac{1}{1+e^{-z}}

where:

z = β₀ + β₁x

Properties:

Always between 0 and 1
Smooth
Interpretable as probability

Understanding the Sigmoid Curve

Suppose:

z = -10

Probability becomes:

Almost 0

Suppose:

z = 0

Probability becomes:

0.5

Suppose:

z = 10

Probability becomes:

Almost 1

This allows us to model probabilities naturally.

Example Dataset

Suppose we want to predict whether a deployed diamond sells.

			
import pandas as pd
data = pd.DataFrame({
    "DaysOut":[
        50,
        100,
        150,
        200,
        300,
        400
    ],
    "Sold":[
        1,
        1,
        1,
        0,
        0,
        0
    ]
})

		

Visualizing the Data

			
import matplotlib.pyplot as plt
plt.scatter(
    data["DaysOut"],
    data["Sold"]
)
plt.xlabel("Days Out")
plt.ylabel("Sold")
plt.show()

		

Notice:

Sold

is binary.

Fitting a Logistic Regression Model

We will use Statsmodels.

import statsmodels.api as sm

Define predictors:

X = data["DaysOut"]

Add intercept:

X = sm.add_constant(X)

Define outcome:

y = data["Sold"]

Fit model:

			
model = sm.Logit(
    y,
    X
).fit()
print(model.summary())

		

Understanding the Output

The summary includes:

Coefficients
Standard errors
z-statistics
p-values

Initially focus on:

coef

and

P>|z|

Interpreting Coefficients

Suppose coefficient for DaysOut is:

−0.03

Interpretation:

			
As DaysOut increases,
probability of sale decreases.

Negative coefficient:

Probability decreases.

Positive coefficient:

Probability increases.

Making Predictions

Suppose we want the probability of sale after:

180 days

Predict:

			
prediction = model.predict(
    [[1,180]]
)
print(prediction)

Output:

0.62

Interpretation:

62% probability of sale

Classification

Probabilities can be converted into classes.

Common rule:

Probability > 0.5

predict:

otherwise:

Example

Probability:

0.80

Prediction:

Sold

Probability:

0.20

Prediction:

Not Sold

Odds

Logistic Regression models odds.

Odds are:

\text{Odds}=\frac{p}{1-p}

Example:

Probability:

0.80

Odds:

0.80 / 0.20 = 4

Meaning:

4 to 1

in favor of sale.

Odds Ratios

One of the most important concepts.

Coefficient:

β

becomes odds ratio:

e^{\beta}

Example

Suppose:

β = 0.5

Odds ratio:

			
import numpy as np
np.exp(0.5)

Output:

1.65

Interpretation:

			
A one-unit increase
multiplies the odds by 1.65

Healthcare Example

Predict hospital readmission.

Dataset:

			
patients = pd.DataFrame({
    "Age":[
        40,
        55,
        70,
        80,
        90
    ],
    "Readmitted":[
        0,
        0,
        1,
        1,
        1
    ]
})

		

Fit model:

			
X = sm.add_constant(
    patients["Age"]
)
y = patients["Readmitted"]
model = sm.Logit(
    y,
    X
).fit()
print(model.summary())

		

Question:

			
Does age increase
readmission risk?

Supply Chain Example

Predict whether inventory sells.

			
inventory = pd.DataFrame({
    "DaysOut":[
        50,
        100,
        150,
        200,
        300,
        400
    ],
    "Sold":[
        1,
        1,
        1,
        0,
        0,
        0
    ]
})

		

Fit model:

			
X = sm.add_constant(
    inventory["DaysOut"]
)
y = inventory["Sold"]
model = sm.Logit(
    y,
    X
).fit()
print(model.summary())

		

Question:

			
How does inventory age
affect probability of sale?

Multiple Logistic Regression

Real-world models usually use multiple predictors.

Example:

Inventory Age
Price
Carat Weight
Color Grade
Clarity Grade

The model becomes:

\log\left(\frac{p}{1-p}\right)=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k

Example

			
data = pd.DataFrame({
    "DaysOut":[
        50,
        100,
        150,
        200,
        300
    ],
    "Price":[
        1000,
        1500,
        1200,
        2000,
        2500
    ],
    "Sold":[
        1,
        1,
        1,
        0,
        0
    ]
})
X = data[
    ["DaysOut","Price"]
]
X = sm.add_constant(X)
y = data["Sold"]
model = sm.Logit(
    y,
    X
).fit()
print(model.summary())

		

Measuring Classification Accuracy

Suppose predictions are:

Actual	Predicted
1	1
1	1
0	0
1	0

Accuracy:

3 correct / 4 total

75%

Confusion Matrix

Very important for classification.

			
from sklearn.metrics import confusion_matrix
confusion_matrix(
    y_true,
    y_pred
)

		

Output:

	Predicted 0	Predicted 1
Actual 0	TN	FP
Actual 1	FN	TP

Precision

Measures:

			
Of predicted positives,
how many were correct?

Recall

Measures:

			
Of actual positives,
how many were found?

ROC Curve

ROC curves evaluate classification models across different probability thresholds.

from sklearn.metrics import roc_curve

We will study this more deeply in future machine learning lessons.

Analyst Workflow

When building a logistic regression:

Visualize:

			
plt.scatter(
    X,
    y
)

Fit model:

			
model = sm.Logit(
    y,
    X
).fit()

Inspect:

print(model.summary())

Predict:

model.predict()

Evaluate:

Accuracy
Precision
Recall
ROC-AUC

Healthcare Exercise

Predict:

Readmitted

using:

			
Age
Length of Stay
Blood Pressure

Questions:

Which factors matter most?
Which factors increase risk?

Supply Chain Exercise

Predict:

Sold

using:

			
Days Out
Price
Inventory Level
Customer Turn

Questions:

Which variables increase probability of sale?
Which variables reduce probability of sale?

Lesson Summary

In this lesson we learned:

Why Logistic Regression exists
The sigmoid function
Probability prediction
Odds
Odds ratios
Logistic coefficients
Classification
Confusion matrices
Accuracy
Multiple logistic regression

Logistic Regression is one of the most important models in analytics because many real-world outcomes are binary. It is widely used in healthcare, finance, operations research, customer analytics, and supply chain optimization.

In the next lesson we will study Poisson Regression, the standard model for count data such as hospital visits, claims, transactions, and SKU sales counts.

nerd-ish

Leave a ReplyCancel reply

Lesson 11: Mixed Models: Analyzing Repeated Measures and Hierarchical Data

Lesson 10: Negative Binomial Regression: Solving Overdispersion in Real-World Count Data

Lesson 10: Customer Lifetime Value Using Machine Learning — Random Forests, Gradient Boosting, XGBoost, Survival Models, and Modern CLV Prediction

Lesson 8: Logistic Regression: Predicting Probabilities and Binary Outcomes

Introduction

Healthcare

Supply Chain

Marketing

The Goal of Logistic Regression

Why Linear Regression Fails

The Logistic Function

Understanding the Sigmoid Curve

Example Dataset

Visualizing the Data

Fitting a Logistic Regression Model

Understanding the Output

Interpreting Coefficients

Making Predictions

Classification

Example

Odds

Odds Ratios

Example

Healthcare Example

Supply Chain Example

Multiple Logistic Regression

Example

Measuring Classification Accuracy

Confusion Matrix

Precision

Recall

ROC Curve

Analyst Workflow

Healthcare Exercise

Supply Chain Exercise

Lesson Summary

Share this:

Like this:

Related posts:

Leave a ReplyCancel reply

Discover more from nerd-ish