Introduction

Linear Regression is arguably the most important model in statistics.

Many modern machine learning methods are extensions of ideas that first appeared in regression analysis.

Linear regression helps answer questions such as:

How does age affect hospital stay?
How does advertising affect sales?
How does inventory level affect revenue?
How does diamond size affect price?
How does customer deployment affect annual sales?

Unlike the statistical tests from the previous lesson, regression allows us to quantify relationships and make predictions.

This lesson introduces:

Simple Linear Regression
Multiple Linear Regression
Model interpretation
Coefficients
R²
Residuals
Model assumptions

By the end of this lesson, you will be able to build and interpret regression models in Python.

The Goal of Regression

Suppose we have:

Inventory	Sales
50	80
75	110
100	140
125	170
150	210

We observe:

			
As inventory increases,
sales appear to increase.

Can we quantify this relationship?

Regression helps us answer that question.

The Regression Equation

The simplest regression model is:

$$y=\beta_0+\beta_1 x+\varepsilon$$

where:

y = outcome variable
x = predictor variable
β₀ = intercept
β₁ = slope
ε = random error

Understanding the Intercept

The intercept represents:

			
Expected value of y
when x = 0

Example:

Sales = 50 + 2 × Inventory

If Inventory = 0:

Sales = 50

The intercept is 50.

Understanding the Slope

The slope measures change.

Example:

Sales = 50 + 2 × Inventory

Interpretation:

			
Each additional unit of inventory
increases expected sales by 2 units.

The slope is often the most important part of a regression model.

Creating a Sample Dataset

			
import pandas as pd
data = pd.DataFrame({
    "Inventory":[
        50,
        75,
        100,
        125,
        150
    ],
    "Sales":[
        80,
        110,
        140,
        170,
        210
    ]
})

		

Visualizing the Relationship

Always visualize before modeling.

			
import matplotlib.pyplot as plt
plt.scatter(
    data["Inventory"],
    data["Sales"]
)
plt.xlabel("Inventory")
plt.ylabel("Sales")
plt.title(
    "Inventory vs Sales"
)
plt.show()

		

Question:

Does the relationship appear linear?

If yes, regression may be appropriate.

Fitting a Linear Regression Model

We will use Statsmodels.

import statsmodels.api as sm

Define predictors:

X = data["Inventory"]

Add intercept:

X = sm.add_constant(X)

Define outcome:

y = data["Sales"]

Fit model:

			
model = sm.OLS(
    y,
    X
).fit()

View results:

print(model.summary())

Understanding the Output

A regression summary contains many statistics.

Focus on:

coef

P>|t|

R-squared

These are the most important initially.

Interpreting Coefficients

Suppose output shows:

Variable	Coefficient
Intercept	18
Inventory	1.28

Model:

Sales = 18 + 1.28 × Inventory

Interpretation:

			
Every additional inventory unit
is associated with
1.28 additional sales units.

Making Predictions

Suppose inventory equals 200.

Prediction:

			
pred = model.predict(
    [[1,200]]
)
print(pred)

The first value:

represents the intercept column.

Understanding R-Squared

R² measures:

			
How much variability
the model explains.

Range:

0 to 1

0% to 100%

Example:

R² = 0.80

Interpretation:

			
80% of variation in sales
is explained by inventory.

What is a Good R²?

There is no universal answer.

Healthcare:

0.20 to 0.40

can be useful.

Engineering:

0.90+

may be expected.

Business:

0.50+

is often considered strong.

Context matters.

Understanding Residuals

Residual:

Observed − Predicted

Example:

Observed sales:

Predicted sales:

Residual:

Residuals represent prediction errors.

Plotting Residuals

			
residuals = model.resid
plt.scatter(
    model.fittedvalues,
    residuals
)
plt.axhline(
    0,
    linestyle="--"
)
plt.xlabel(
    "Predicted Values"
)
plt.ylabel(
    "Residuals"
)
plt.show()

		

A random cloud is desirable.

Patterns suggest model problems.

Healthcare Example

Suppose we want to predict hospital stay.

			
patients = pd.DataFrame({
    "Age":[
        25,
        40,
        55,
        70,
        85
    ],
    "LengthOfStay":[
        2,
        4,
        6,
        9,
        11
    ]
})

		

Fit model:

			
X = sm.add_constant(
    patients["Age"]
)
y = patients["LengthOfStay"]
model = sm.OLS(
    y,
    X
).fit()
print(model.summary())

		

Question:

			
Does age predict
length of stay?

Supply Chain Example

Suppose we want to predict sales.

			
inventory = pd.DataFrame({
    "Inventory":[
        100,
        150,
        200,
        250,
        300
    ],
    "Sales":[
        120,
        180,
        210,
        260,
        330
    ]
})

		

Fit regression:

			
X = sm.add_constant(
    inventory["Inventory"]
)
y = inventory["Sales"]
model = sm.OLS(
    y,
    X
).fit()
print(model.summary())

		

Question:

			
How strongly does
inventory drive sales?

Multiple Linear Regression

Real-world problems usually involve multiple predictors.

The model becomes:

$$y=\beta_0+\beta_1 x_1+\beta_2 x_2+\cdots+\beta_p x_p+\varepsilon$$

Example:

			
Sales predicted from:
Inventory
Price
Marketing Spend
Customer Count

		

Example

			
data = pd.DataFrame({
    "Inventory":[100,150,200,250,300],
    "Price":[10,12,11,14,13],
    "Sales":[120,180,210,260,330]
})

		

Fit model:

			
X = data[
    ["Inventory","Price"]
]
X = sm.add_constant(X)
y = data["Sales"]
model = sm.OLS(
    y,
    X
).fit()
print(model.summary())

		

Assumptions of Linear Regression

Regression assumes:

1. Linearity

Relationship should be approximately linear.

2. Independence

Observations should be independent.

3. Constant Variance

Residual variance should be stable.

4. Normal Residuals

Residuals should be approximately normal.

5. No Extreme Multicollinearity

Predictors should not be excessively correlated.

Analyst Workflow

When building a regression model:

df.describe()

Visualize:

			
plt.scatter(
    df["Inventory"],
    df["Sales"]
)

Fit model:

			
model = sm.OLS(
    y,
    X
).fit()

Inspect:

			
print(
    model.summary()
)

Evaluate:

Coefficients
p-values
R²
Residual plots

Practical Healthcare Exercise

Predict:

Length of Stay

using:

			
Age
Blood Pressure
Heart Rate

Questions:

Which variables matter most?
How much variation is explained?

Practical Supply Chain Exercise

Predict:

Annual Sales

using:

			
Inventory
Deployment
Price
Marketing Spend

Questions:

Which factors drive sales?
Which factors are insignificant?

Lesson Summary

In this lesson we learned:

What linear regression is
Intercepts and slopes
Model fitting
Predictions
R²
Residuals
Multiple regression
Model assumptions

Linear regression is the foundation of predictive analytics because it transforms relationships into quantitative models that can be interpreted, tested, and used for prediction.

In the next lesson we will study Logistic Regression, which is used when the outcome is binary such as:

Sold vs Not Sold
Readmitted vs Not Readmitted
Churned vs Not Churned
Disease vs No Disease

nerd-ish

Leave a ReplyCancel reply

Algebraic Geometry: The Geometry Hidden Inside Polynomial Equations

What Is a Transcendental Number?

Do All Subclasses of Irrational Numbers Have Measure One?

Lesson 7: Linear Regression: The Foundation of Predictive Analytics

Introduction

The Goal of Regression

The Regression Equation

Understanding the Intercept

Understanding the Slope

Creating a Sample Dataset

Visualizing the Relationship

Fitting a Linear Regression Model

Understanding the Output

Interpreting Coefficients

Making Predictions

Understanding R-Squared

What is a Good R²?

Understanding Residuals

Plotting Residuals

Healthcare Example

Supply Chain Example

Multiple Linear Regression

Example

Assumptions of Linear Regression

1. Linearity

2. Independence

3. Constant Variance

4. Normal Residuals

5. No Extreme Multicollinearity

Analyst Workflow

Practical Healthcare Exercise

Practical Supply Chain Exercise

Lesson Summary

Share this:

Like this:

Related posts:

Leave a ReplyCancel reply

Discover more from nerd-ish