Lesson 7: Linear Regression: The Foundation of Predictive Analytics

Introduction

Linear Regression is arguably the most important model in statistics.

Many modern machine learning methods are extensions of ideas that first appeared in regression analysis.

Linear regression helps answer questions such as:

  • How does age affect hospital stay?
  • How does advertising affect sales?
  • How does inventory level affect revenue?
  • How does diamond size affect price?
  • How does customer deployment affect annual sales?

Unlike the statistical tests from the previous lesson, regression allows us to quantify relationships and make predictions.

This lesson introduces:

  • Simple Linear Regression
  • Multiple Linear Regression
  • Model interpretation
  • Coefficients
  • Residuals
  • Model assumptions

By the end of this lesson, you will be able to build and interpret regression models in Python.


The Goal of Regression

Suppose we have:

InventorySales
5080
75110
100140
125170
150210

We observe:

As inventory increases,
sales appear to increase.

Can we quantify this relationship?

Regression helps us answer that question.


The Regression Equation

The simplest regression model is:

$$y=\beta_0+\beta_1 x+\varepsilon$$

where:

  • y = outcome variable
  • x = predictor variable
  • β₀ = intercept
  • β₁ = slope
  • ε = random error

Understanding the Intercept

The intercept represents:

Expected value of y
when x = 0

Example:

Sales = 50 + 2 × Inventory

If Inventory = 0:

Sales = 50

The intercept is 50.


Understanding the Slope

The slope measures change.

Example:

Sales = 50 + 2 × Inventory

Interpretation:

Each additional unit of inventory
increases expected sales by 2 units.

The slope is often the most important part of a regression model.


Creating a Sample Dataset

import pandas as pd
data = pd.DataFrame({
"Inventory":[
50,
75,
100,
125,
150
],
"Sales":[
80,
110,
140,
170,
210
]
})

Visualizing the Relationship

Always visualize before modeling.

import matplotlib.pyplot as plt
plt.scatter(
data["Inventory"],
data["Sales"]
)
plt.xlabel("Inventory")
plt.ylabel("Sales")
plt.title(
"Inventory vs Sales"
)
plt.show()

Question:

Does the relationship appear linear?

If yes, regression may be appropriate.


Fitting a Linear Regression Model

We will use Statsmodels.

import statsmodels.api as sm

Define predictors:

X = data["Inventory"]

Add intercept:

X = sm.add_constant(X)

Define outcome:

y = data["Sales"]

Fit model:

model = sm.OLS(
y,
X
).fit()

View results:

print(model.summary())

Understanding the Output

A regression summary contains many statistics.

Focus on:

coef
P>|t|
R-squared

These are the most important initially.


Interpreting Coefficients

Suppose output shows:

VariableCoefficient
Intercept18
Inventory1.28

Model:

Sales = 18 + 1.28 × Inventory

Interpretation:

Every additional inventory unit
is associated with
1.28 additional sales units.

Making Predictions

Suppose inventory equals 200.

Prediction:

pred = model.predict(
[[1,200]]
)
print(pred)

The first value:

1

represents the intercept column.


Understanding R-Squared

R² measures:

How much variability
the model explains.

Range:

0 to 1

or

0% to 100%

Example:

R² = 0.80

Interpretation:

80% of variation in sales
is explained by inventory.

What is a Good R²?

There is no universal answer.

Healthcare:

0.20 to 0.40

can be useful.

Engineering:

0.90+

may be expected.

Business:

0.50+

is often considered strong.

Context matters.


Understanding Residuals

Residual:

Observed − Predicted

Example:

Observed sales:

150

Predicted sales:

140

Residual:

10

Residuals represent prediction errors.


Plotting Residuals

residuals = model.resid
plt.scatter(
model.fittedvalues,
residuals
)
plt.axhline(
0,
linestyle="--"
)
plt.xlabel(
"Predicted Values"
)
plt.ylabel(
"Residuals"
)
plt.show()

A random cloud is desirable.

Patterns suggest model problems.


Healthcare Example

Suppose we want to predict hospital stay.

patients = pd.DataFrame({
"Age":[
25,
40,
55,
70,
85
],
"LengthOfStay":[
2,
4,
6,
9,
11
]
})

Fit model:

X = sm.add_constant(
patients["Age"]
)
y = patients["LengthOfStay"]
model = sm.OLS(
y,
X
).fit()
print(model.summary())

Question:

Does age predict
length of stay?

Supply Chain Example

Suppose we want to predict sales.

inventory = pd.DataFrame({
"Inventory":[
100,
150,
200,
250,
300
],
"Sales":[
120,
180,
210,
260,
330
]
})

Fit regression:

X = sm.add_constant(
inventory["Inventory"]
)
y = inventory["Sales"]
model = sm.OLS(
y,
X
).fit()
print(model.summary())

Question:

How strongly does
inventory drive sales?

Multiple Linear Regression

Real-world problems usually involve multiple predictors.

The model becomes:

$$y=\beta_0+\beta_1 x_1+\beta_2 x_2+\cdots+\beta_p x_p+\varepsilon$$

Example:

Sales predicted from:
Inventory
Price
Marketing Spend
Customer Count

Example

data = pd.DataFrame({
"Inventory":[100,150,200,250,300],
"Price":[10,12,11,14,13],
"Sales":[120,180,210,260,330]
})

Fit model:

X = data[
["Inventory","Price"]
]
X = sm.add_constant(X)
y = data["Sales"]
model = sm.OLS(
y,
X
).fit()
print(model.summary())

Assumptions of Linear Regression

Regression assumes:

1. Linearity

Relationship should be approximately linear.

2. Independence

Observations should be independent.

3. Constant Variance

Residual variance should be stable.

4. Normal Residuals

Residuals should be approximately normal.

5. No Extreme Multicollinearity

Predictors should not be excessively correlated.


Analyst Workflow

When building a regression model:

df.describe()

Visualize:

plt.scatter(
df["Inventory"],
df["Sales"]
)

Fit model:

model = sm.OLS(
y,
X
).fit()

Inspect:

print(
model.summary()
)

Evaluate:

  • Coefficients
  • p-values
  • Residual plots

Practical Healthcare Exercise

Predict:

Length of Stay

using:

Age
Blood Pressure
Heart Rate

Questions:

  • Which variables matter most?
  • How much variation is explained?

Practical Supply Chain Exercise

Predict:

Annual Sales

using:

Inventory
Deployment
Price
Marketing Spend

Questions:

  • Which factors drive sales?
  • Which factors are insignificant?

Lesson Summary

In this lesson we learned:

  • What linear regression is
  • Intercepts and slopes
  • Model fitting
  • Predictions
  • Residuals
  • Multiple regression
  • Model assumptions

Linear regression is the foundation of predictive analytics because it transforms relationships into quantitative models that can be interpreted, tested, and used for prediction.

In the next lesson we will study Logistic Regression, which is used when the outcome is binary such as:

  • Sold vs Not Sold
  • Readmitted vs Not Readmitted
  • Churned vs Not Churned
  • Disease vs No Disease

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading