Lesson 8: Logistic Regression: Predicting Probabilities and Binary Outcomes

Introduction

In the previous lesson, we studied Linear Regression.

Linear Regression works well when the outcome is continuous.

Examples:

  • Sales Revenue
  • Length of Stay
  • Blood Pressure
  • Diamond Price

However, many real-world problems involve outcomes with only two possibilities.

Examples:

Healthcare

  • Readmitted or Not Readmitted
  • Disease or No Disease
  • Survived or Died

Supply Chain

  • Sold or Not Sold
  • Customer Churned or Stayed
  • Returned or Not Returned

Marketing

  • Purchased or Did Not Purchase
  • Clicked or Did Not Click

Linear regression is not appropriate for these situations because probabilities must lie between:

0 and 1

Logistic Regression solves this problem.

It predicts probabilities.


The Goal of Logistic Regression

Suppose we want to predict:

Will a deployed diamond sell?

Outcome:

Sold
1
0
1
1
0

where:

1 = Sold
0 = Not Sold

Instead of predicting:

0 or 1 directly

Logistic Regression predicts:

Probability of Sale

For example:

0.82

meaning:

82% chance of sale

Why Linear Regression Fails

Suppose we fit:

Sold = β₀ + β₁ × Age

Predictions could become:

1.4

or

−0.3

These are impossible probabilities.

Probabilities must satisfy:

0 ≤ p ≤ 1

Logistic Regression guarantees this.


The Logistic Function

Logistic Regression uses the sigmoid curve.

p=\frac{1}{1+e^{-z}}

where:

z = β₀ + β₁x

Properties:

  • Always between 0 and 1
  • Smooth
  • Interpretable as probability

Understanding the Sigmoid Curve

Suppose:

z = -10

Probability becomes:

Almost 0

Suppose:

z = 0

Probability becomes:

0.5

Suppose:

z = 10

Probability becomes:

Almost 1

This allows us to model probabilities naturally.


Example Dataset

Suppose we want to predict whether a deployed diamond sells.

import pandas as pd
data = pd.DataFrame({
"DaysOut":[
50,
100,
150,
200,
300,
400
],
"Sold":[
1,
1,
1,
0,
0,
0
]
})

Visualizing the Data

import matplotlib.pyplot as plt
plt.scatter(
data["DaysOut"],
data["Sold"]
)
plt.xlabel("Days Out")
plt.ylabel("Sold")
plt.show()

Notice:

Sold

is binary.


Fitting a Logistic Regression Model

We will use Statsmodels.

import statsmodels.api as sm

Define predictors:

X = data["DaysOut"]

Add intercept:

X = sm.add_constant(X)

Define outcome:

y = data["Sold"]

Fit model:

model = sm.Logit(
y,
X
).fit()
print(model.summary())

Understanding the Output

The summary includes:

  • Coefficients
  • Standard errors
  • z-statistics
  • p-values

Initially focus on:

coef

and

P>|z|

Interpreting Coefficients

Suppose coefficient for DaysOut is:

−0.03

Interpretation:

As DaysOut increases,
probability of sale decreases.

Negative coefficient:

Probability decreases.

Positive coefficient:

Probability increases.

Making Predictions

Suppose we want the probability of sale after:

180 days

Predict:

prediction = model.predict(
[[1,180]]
)
print(prediction)

Output:

0.62

Interpretation:

62% probability of sale

Classification

Probabilities can be converted into classes.

Common rule:

Probability > 0.5

predict:

1

otherwise:

0

Example

Probability:

0.80

Prediction:

Sold

Probability:

0.20

Prediction:

Not Sold

Odds

Logistic Regression models odds.

Odds are:

\text{Odds}=\frac{p}{1-p}

Example:

Probability:

0.80

Odds:

0.80 / 0.20 = 4

Meaning:

4 to 1

in favor of sale.


Odds Ratios

One of the most important concepts.

Coefficient:

β

becomes odds ratio:

e^{\beta}


Example

Suppose:

β = 0.5

Odds ratio:

import numpy as np
np.exp(0.5)

Output:

1.65

Interpretation:

A one-unit increase
multiplies the odds by 1.65

Healthcare Example

Predict hospital readmission.

Dataset:

patients = pd.DataFrame({
"Age":[
40,
55,
70,
80,
90
],
"Readmitted":[
0,
0,
1,
1,
1
]
})

Fit model:

X = sm.add_constant(
patients["Age"]
)
y = patients["Readmitted"]
model = sm.Logit(
y,
X
).fit()
print(model.summary())

Question:

Does age increase
readmission risk?

Supply Chain Example

Predict whether inventory sells.

inventory = pd.DataFrame({
"DaysOut":[
50,
100,
150,
200,
300,
400
],
"Sold":[
1,
1,
1,
0,
0,
0
]
})

Fit model:

X = sm.add_constant(
inventory["DaysOut"]
)
y = inventory["Sold"]
model = sm.Logit(
y,
X
).fit()
print(model.summary())

Question:

How does inventory age
affect probability of sale?

Multiple Logistic Regression

Real-world models usually use multiple predictors.

Example:

  • Inventory Age
  • Price
  • Carat Weight
  • Color Grade
  • Clarity Grade

The model becomes:

\log\left(\frac{p}{1-p}\right)=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k


Example

data = pd.DataFrame({
"DaysOut":[
50,
100,
150,
200,
300
],
"Price":[
1000,
1500,
1200,
2000,
2500
],
"Sold":[
1,
1,
1,
0,
0
]
})
X = data[
["DaysOut","Price"]
]
X = sm.add_constant(X)
y = data["Sold"]
model = sm.Logit(
y,
X
).fit()
print(model.summary())

Measuring Classification Accuracy

Suppose predictions are:

ActualPredicted
11
11
00
10

Accuracy:

3 correct / 4 total
75%

Confusion Matrix

Very important for classification.

from sklearn.metrics import confusion_matrix
confusion_matrix(
y_true,
y_pred
)

Output:

Predicted 0Predicted 1
Actual 0TNFP
Actual 1FNTP

Precision

Measures:

Of predicted positives,
how many were correct?

Recall

Measures:

Of actual positives,
how many were found?

ROC Curve

ROC curves evaluate classification models across different probability thresholds.

from sklearn.metrics import roc_curve

We will study this more deeply in future machine learning lessons.


Analyst Workflow

When building a logistic regression:

Visualize:

plt.scatter(
X,
y
)

Fit model:

model = sm.Logit(
y,
X
).fit()

Inspect:

print(model.summary())

Predict:

model.predict()

Evaluate:

  • Accuracy
  • Precision
  • Recall
  • ROC-AUC

Healthcare Exercise

Predict:

Readmitted

using:

Age
Length of Stay
Blood Pressure

Questions:

  • Which factors matter most?
  • Which factors increase risk?

Supply Chain Exercise

Predict:

Sold

using:

Days Out
Price
Inventory Level
Customer Turn

Questions:

  • Which variables increase probability of sale?
  • Which variables reduce probability of sale?

Lesson Summary

In this lesson we learned:

  • Why Logistic Regression exists
  • The sigmoid function
  • Probability prediction
  • Odds
  • Odds ratios
  • Logistic coefficients
  • Classification
  • Confusion matrices
  • Accuracy
  • Multiple logistic regression

Logistic Regression is one of the most important models in analytics because many real-world outcomes are binary. It is widely used in healthcare, finance, operations research, customer analytics, and supply chain optimization.

In the next lesson we will study Poisson Regression, the standard model for count data such as hospital visits, claims, transactions, and SKU sales counts.

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading