Introduction
In the previous lesson, we studied Linear Regression.
Linear Regression works well when the outcome is continuous.
Examples:
- Sales Revenue
- Length of Stay
- Blood Pressure
- Diamond Price
However, many real-world problems involve outcomes with only two possibilities.
Examples:
Healthcare
- Readmitted or Not Readmitted
- Disease or No Disease
- Survived or Died
Supply Chain
- Sold or Not Sold
- Customer Churned or Stayed
- Returned or Not Returned
Marketing
- Purchased or Did Not Purchase
- Clicked or Did Not Click
Linear regression is not appropriate for these situations because probabilities must lie between:
0 and 1
Logistic Regression solves this problem.
It predicts probabilities.
The Goal of Logistic Regression
Suppose we want to predict:
Will a deployed diamond sell?
Outcome:
| Sold |
|---|
| 1 |
| 0 |
| 1 |
| 1 |
| 0 |
where:
1 = Sold
0 = Not Sold
Instead of predicting:
0 or 1 directly
Logistic Regression predicts:
Probability of Sale
For example:
0.82
meaning:
82% chance of sale
Why Linear Regression Fails
Suppose we fit:
Sold = β₀ + β₁ × Age
Predictions could become:
1.4
or
−0.3
These are impossible probabilities.
Probabilities must satisfy:
0 ≤ p ≤ 1
Logistic Regression guarantees this.
The Logistic Function
Logistic Regression uses the sigmoid curve.
p=\frac{1}{1+e^{-z}}
where:
z = β₀ + β₁x
Properties:
- Always between 0 and 1
- Smooth
- Interpretable as probability
Understanding the Sigmoid Curve
Suppose:
z = -10
Probability becomes:
Almost 0
Suppose:
z = 0
Probability becomes:
0.5
Suppose:
z = 10
Probability becomes:
Almost 1
This allows us to model probabilities naturally.
Example Dataset
Suppose we want to predict whether a deployed diamond sells.
import pandas as pddata = pd.DataFrame({ "DaysOut":[ 50, 100, 150, 200, 300, 400 ], "Sold":[ 1, 1, 1, 0, 0, 0 ]})
Visualizing the Data
import matplotlib.pyplot as pltplt.scatter( data["DaysOut"], data["Sold"])plt.xlabel("Days Out")plt.ylabel("Sold")plt.show()
Notice:
Sold
is binary.
Fitting a Logistic Regression Model
We will use Statsmodels.
import statsmodels.api as sm
Define predictors:
X = data["DaysOut"]
Add intercept:
X = sm.add_constant(X)
Define outcome:
y = data["Sold"]
Fit model:
model = sm.Logit( y, X).fit()print(model.summary())
Understanding the Output
The summary includes:
- Coefficients
- Standard errors
- z-statistics
- p-values
Initially focus on:
coef
and
P>|z|
Interpreting Coefficients
Suppose coefficient for DaysOut is:
−0.03
Interpretation:
As DaysOut increases,probability of sale decreases.
Negative coefficient:
Probability decreases.
Positive coefficient:
Probability increases.
Making Predictions
Suppose we want the probability of sale after:
180 days
Predict:
prediction = model.predict( [[1,180]])print(prediction)
Output:
0.62
Interpretation:
62% probability of sale
Classification
Probabilities can be converted into classes.
Common rule:
Probability > 0.5
predict:
1
otherwise:
0
Example
Probability:
0.80
Prediction:
Sold
Probability:
0.20
Prediction:
Not Sold
Odds
Logistic Regression models odds.
Odds are:
\text{Odds}=\frac{p}{1-p}
Example:
Probability:
0.80
Odds:
0.80 / 0.20 = 4
Meaning:
4 to 1
in favor of sale.
Odds Ratios
One of the most important concepts.
Coefficient:
β
becomes odds ratio:
e^{\beta}
Example
Suppose:
β = 0.5
Odds ratio:
import numpy as npnp.exp(0.5)
Output:
1.65
Interpretation:
A one-unit increasemultiplies the odds by 1.65
Healthcare Example
Predict hospital readmission.
Dataset:
patients = pd.DataFrame({ "Age":[ 40, 55, 70, 80, 90 ], "Readmitted":[ 0, 0, 1, 1, 1 ]})
Fit model:
X = sm.add_constant( patients["Age"])y = patients["Readmitted"]model = sm.Logit( y, X).fit()print(model.summary())
Question:
Does age increasereadmission risk?
Supply Chain Example
Predict whether inventory sells.
inventory = pd.DataFrame({ "DaysOut":[ 50, 100, 150, 200, 300, 400 ], "Sold":[ 1, 1, 1, 0, 0, 0 ]})
Fit model:
X = sm.add_constant( inventory["DaysOut"])y = inventory["Sold"]model = sm.Logit( y, X).fit()print(model.summary())
Question:
How does inventory ageaffect probability of sale?
Multiple Logistic Regression
Real-world models usually use multiple predictors.
Example:
- Inventory Age
- Price
- Carat Weight
- Color Grade
- Clarity Grade
The model becomes:
\log\left(\frac{p}{1-p}\right)=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k
Example
data = pd.DataFrame({ "DaysOut":[ 50, 100, 150, 200, 300 ], "Price":[ 1000, 1500, 1200, 2000, 2500 ], "Sold":[ 1, 1, 1, 0, 0 ]})X = data[ ["DaysOut","Price"]]X = sm.add_constant(X)y = data["Sold"]model = sm.Logit( y, X).fit()print(model.summary())
Measuring Classification Accuracy
Suppose predictions are:
| Actual | Predicted |
|---|---|
| 1 | 1 |
| 1 | 1 |
| 0 | 0 |
| 1 | 0 |
Accuracy:
3 correct / 4 total
75%
Confusion Matrix
Very important for classification.
from sklearn.metrics import confusion_matrixconfusion_matrix( y_true, y_pred)
Output:
| Predicted 0 | Predicted 1 | |
|---|---|---|
| Actual 0 | TN | FP |
| Actual 1 | FN | TP |
Precision
Measures:
Of predicted positives,how many were correct?
Recall
Measures:
Of actual positives,how many were found?
ROC Curve
ROC curves evaluate classification models across different probability thresholds.
from sklearn.metrics import roc_curve
We will study this more deeply in future machine learning lessons.
Analyst Workflow
When building a logistic regression:
Visualize:
plt.scatter( X, y)
Fit model:
model = sm.Logit( y, X).fit()
Inspect:
print(model.summary())
Predict:
model.predict()
Evaluate:
- Accuracy
- Precision
- Recall
- ROC-AUC
Healthcare Exercise
Predict:
Readmitted
using:
AgeLength of StayBlood Pressure
Questions:
- Which factors matter most?
- Which factors increase risk?
Supply Chain Exercise
Predict:
Sold
using:
Days OutPriceInventory LevelCustomer Turn
Questions:
- Which variables increase probability of sale?
- Which variables reduce probability of sale?
Lesson Summary
In this lesson we learned:
- Why Logistic Regression exists
- The sigmoid function
- Probability prediction
- Odds
- Odds ratios
- Logistic coefficients
- Classification
- Confusion matrices
- Accuracy
- Multiple logistic regression
Logistic Regression is one of the most important models in analytics because many real-world outcomes are binary. It is widely used in healthcare, finance, operations research, customer analytics, and supply chain optimization.
In the next lesson we will study Poisson Regression, the standard model for count data such as hospital visits, claims, transactions, and SKU sales counts.

Leave a Reply