Introduction
Linear Regression is arguably the most important model in statistics.
Many modern machine learning methods are extensions of ideas that first appeared in regression analysis.
Linear regression helps answer questions such as:
- How does age affect hospital stay?
- How does advertising affect sales?
- How does inventory level affect revenue?
- How does diamond size affect price?
- How does customer deployment affect annual sales?
Unlike the statistical tests from the previous lesson, regression allows us to quantify relationships and make predictions.
This lesson introduces:
- Simple Linear Regression
- Multiple Linear Regression
- Model interpretation
- Coefficients
- R²
- Residuals
- Model assumptions
By the end of this lesson, you will be able to build and interpret regression models in Python.
The Goal of Regression
Suppose we have:
| Inventory | Sales |
|---|---|
| 50 | 80 |
| 75 | 110 |
| 100 | 140 |
| 125 | 170 |
| 150 | 210 |
We observe:
As inventory increases,sales appear to increase.
Can we quantify this relationship?
Regression helps us answer that question.
The Regression Equation
The simplest regression model is:
$$y=\beta_0+\beta_1 x+\varepsilon$$
where:
- y = outcome variable
- x = predictor variable
- β₀ = intercept
- β₁ = slope
- ε = random error
Understanding the Intercept
The intercept represents:
Expected value of ywhen x = 0
Example:
Sales = 50 + 2 × Inventory
If Inventory = 0:
Sales = 50
The intercept is 50.
Understanding the Slope
The slope measures change.
Example:
Sales = 50 + 2 × Inventory
Interpretation:
Each additional unit of inventoryincreases expected sales by 2 units.
The slope is often the most important part of a regression model.
Creating a Sample Dataset
import pandas as pddata = pd.DataFrame({ "Inventory":[ 50, 75, 100, 125, 150 ], "Sales":[ 80, 110, 140, 170, 210 ]})
Visualizing the Relationship
Always visualize before modeling.
import matplotlib.pyplot as pltplt.scatter( data["Inventory"], data["Sales"])plt.xlabel("Inventory")plt.ylabel("Sales")plt.title( "Inventory vs Sales")plt.show()
Question:
Does the relationship appear linear?
If yes, regression may be appropriate.
Fitting a Linear Regression Model
We will use Statsmodels.
import statsmodels.api as sm
Define predictors:
X = data["Inventory"]
Add intercept:
X = sm.add_constant(X)
Define outcome:
y = data["Sales"]
Fit model:
model = sm.OLS( y, X).fit()
View results:
print(model.summary())
Understanding the Output
A regression summary contains many statistics.
Focus on:
coef
P>|t|
R-squared
These are the most important initially.
Interpreting Coefficients
Suppose output shows:
| Variable | Coefficient |
|---|---|
| Intercept | 18 |
| Inventory | 1.28 |
Model:
Sales = 18 + 1.28 × Inventory
Interpretation:
Every additional inventory unitis associated with1.28 additional sales units.
Making Predictions
Suppose inventory equals 200.
Prediction:
pred = model.predict( [[1,200]])print(pred)
The first value:
1
represents the intercept column.
Understanding R-Squared
R² measures:
How much variabilitythe model explains.
Range:
0 to 1
or
0% to 100%
Example:
R² = 0.80
Interpretation:
80% of variation in salesis explained by inventory.
What is a Good R²?
There is no universal answer.
Healthcare:
0.20 to 0.40
can be useful.
Engineering:
0.90+
may be expected.
Business:
0.50+
is often considered strong.
Context matters.
Understanding Residuals
Residual:
Observed − Predicted
Example:
Observed sales:
150
Predicted sales:
140
Residual:
10
Residuals represent prediction errors.
Plotting Residuals
residuals = model.residplt.scatter( model.fittedvalues, residuals)plt.axhline( 0, linestyle="--")plt.xlabel( "Predicted Values")plt.ylabel( "Residuals")plt.show()
A random cloud is desirable.
Patterns suggest model problems.
Healthcare Example
Suppose we want to predict hospital stay.
patients = pd.DataFrame({ "Age":[ 25, 40, 55, 70, 85 ], "LengthOfStay":[ 2, 4, 6, 9, 11 ]})
Fit model:
X = sm.add_constant( patients["Age"])y = patients["LengthOfStay"]model = sm.OLS( y, X).fit()print(model.summary())
Question:
Does age predictlength of stay?
Supply Chain Example
Suppose we want to predict sales.
inventory = pd.DataFrame({ "Inventory":[ 100, 150, 200, 250, 300 ], "Sales":[ 120, 180, 210, 260, 330 ]})
Fit regression:
X = sm.add_constant( inventory["Inventory"])y = inventory["Sales"]model = sm.OLS( y, X).fit()print(model.summary())
Question:
How strongly doesinventory drive sales?
Multiple Linear Regression
Real-world problems usually involve multiple predictors.
The model becomes:
$$y=\beta_0+\beta_1 x_1+\beta_2 x_2+\cdots+\beta_p x_p+\varepsilon$$
Example:
Sales predicted from:InventoryPriceMarketing SpendCustomer Count
Example
data = pd.DataFrame({ "Inventory":[100,150,200,250,300], "Price":[10,12,11,14,13], "Sales":[120,180,210,260,330]})
Fit model:
X = data[ ["Inventory","Price"]]X = sm.add_constant(X)y = data["Sales"]model = sm.OLS( y, X).fit()print(model.summary())
Assumptions of Linear Regression
Regression assumes:
1. Linearity
Relationship should be approximately linear.
2. Independence
Observations should be independent.
3. Constant Variance
Residual variance should be stable.
4. Normal Residuals
Residuals should be approximately normal.
5. No Extreme Multicollinearity
Predictors should not be excessively correlated.
Analyst Workflow
When building a regression model:
df.describe()
Visualize:
plt.scatter( df["Inventory"], df["Sales"])
Fit model:
model = sm.OLS( y, X).fit()
Inspect:
print( model.summary())
Evaluate:
- Coefficients
- p-values
- R²
- Residual plots
Practical Healthcare Exercise
Predict:
Length of Stay
using:
AgeBlood PressureHeart Rate
Questions:
- Which variables matter most?
- How much variation is explained?
Practical Supply Chain Exercise
Predict:
Annual Sales
using:
InventoryDeploymentPriceMarketing Spend
Questions:
- Which factors drive sales?
- Which factors are insignificant?
Lesson Summary
In this lesson we learned:
- What linear regression is
- Intercepts and slopes
- Model fitting
- Predictions
- R²
- Residuals
- Multiple regression
- Model assumptions
Linear regression is the foundation of predictive analytics because it transforms relationships into quantitative models that can be interpreted, tested, and used for prediction.
In the next lesson we will study Logistic Regression, which is used when the outcome is binary such as:
- Sold vs Not Sold
- Readmitted vs Not Readmitted
- Churned vs Not Churned
- Disease vs No Disease

Leave a Reply