Introduction
In Lesson 3, we learned how to summarize a single dataset using GroupBy.
However, in real-world analytics, the information you need is rarely contained in a single table.
For example:
Healthcare
You may have:
- Patient table
- Admissions table
- Diagnosis table
- Medication table
Supply Chain
You may have:
- Sales table
- Inventory table
- Customer table
- Product table
To perform meaningful analysis, these datasets must be combined.
This process is called a join.
If you know SQL, this is equivalent to:
JOIN
In Pandas, the primary function is:
pd.merge()
Understanding joins is one of the most important skills for a data analyst.
Why Do We Need Joins?
Suppose we have sales information.
import pandas as pdsales = pd.DataFrame({ "CustomerID": [1, 2, 3], "Sales": [1000, 2000, 3000]})print(sales)
Output:
| CustomerID | Sales |
|---|---|
| 1 | 1000 |
| 2 | 2000 |
| 3 | 3000 |
Suppose customer names are stored separately.
customers = pd.DataFrame({ "CustomerID": [1, 2, 3], "CustomerName": [ "Alpha", "Beta", "Gamma" ]})print(customers)
Output:
| CustomerID | CustomerName |
|---|---|
| 1 | Alpha |
| 2 | Beta |
| 3 | Gamma |
Neither table alone tells the full story.
We need to combine them.
Understanding Keys
A key is a column used to connect tables.
In our example:
CustomerID
exists in both tables.
This is called the join key.
Inner Join
The most common type of join.
Keep only records that exist in both tables.
result = pd.merge( sales, customers, on="CustomerID", how="inner")print(result)
Output:
| CustomerID | Sales | CustomerName |
|---|---|---|
| 1 | 1000 | Alpha |
| 2 | 2000 | Beta |
| 3 | 3000 | Gamma |
What Does Inner Join Do?
Think of it as:
SalesINTERSECTIONCustomers
Only matching records survive.
Left Join
A left join keeps everything from the left table.
Example:
sales = pd.DataFrame({ "CustomerID": [1, 2, 3, 4], "Sales": [1000, 2000, 3000, 1500]})
Notice CustomerID 4 does not exist in the customer table.
Perform a left join:
result = pd.merge( sales, customers, on="CustomerID", how="left")print(result)
Output:
| CustomerID | Sales | CustomerName |
|---|---|---|
| 1 | 1000 | Alpha |
| 2 | 2000 | Beta |
| 3 | 3000 | Gamma |
| 4 | 1500 | NaN |
Customer 4 remains because we kept all records from the left table.
This is one of the most common joins in analytics.
Right Join
Keep everything from the right table.
result = pd.merge( sales, customers, on="CustomerID", how="right")print(result)
All customer records remain.
Less commonly used than left joins.
Outer Join
Keep everything from both tables.
result = pd.merge( sales, customers, on="CustomerID", how="outer")print(result)
Output contains every record from both datasets.
Missing matches become NaN.
Visualizing Join Types
Suppose:
Sales IDs:1 2 3 4Customer IDs:1 2 3 5
Inner Join:
1 2 3
Left Join:
1 2 3 4
Right Join:
1 2 3 5
Outer Join:
1 2 3 4 5
Joining on Different Column Names
Sometimes keys have different names.
Sales table:
sales = pd.DataFrame({ "CustID": [1, 2, 3], "Sales": [1000, 2000, 3000]})
Customer table:
customers = pd.DataFrame({ "CustomerID": [1, 2, 3], "CustomerName": [ "Alpha", "Beta", "Gamma" ]})
Join:
result = pd.merge( sales, customers, left_on="CustID", right_on="CustomerID", how="inner")
Joining Multiple Tables
This is extremely common.
Suppose:
Sales table:
sales = pd.DataFrame({ "ProductID": [101, 102, 103], "Sales": [500, 300, 700]})
Product table:
products = pd.DataFrame({ "ProductID": [101, 102, 103], "ProductName": [ "Diamond A", "Diamond B", "Diamond C" ]})
Category table:
categories = pd.DataFrame({ "ProductName": [ "Diamond A", "Diamond B", "Diamond C" ], "Category": [ "Round", "Oval", "Princess" ]})
First join:
sales_products = pd.merge( sales, products, on="ProductID")
Second join:
final = pd.merge( sales_products, categories, on="ProductName")
Healthcare Example
Patient information:
patients = pd.DataFrame({ "PatientID": [1, 2, 3], "Age": [45, 67, 30]})
Admissions:
admissions = pd.DataFrame({ "PatientID": [1, 2, 3], "LengthOfStay": [5, 8, 2]})
Combine:
patient_data = pd.merge( patients, admissions, on="PatientID")print(patient_data)
Output:
| PatientID | Age | LengthOfStay |
|---|---|---|
| 1 | 45 | 5 |
| 2 | 67 | 8 |
| 3 | 30 | 2 |
Supply Chain Example
Inventory:
inventory = pd.DataFrame({ "SKU": [ "A100", "B200", "C300" ], "Inventory": [ 100, 150, 80 ]})
Sales:
sales = pd.DataFrame({ "SKU": [ "A100", "B200", "C300" ], "Sales": [ 250, 125, 500 ]})
Join:
sku_data = pd.merge( inventory, sales, on="SKU")print(sku_data)
Output:
| SKU | Inventory | Sales |
|---|---|---|
| A100 | 100 | 250 |
| B200 | 150 | 125 |
| C300 | 80 | 500 |
Creating New Metrics After Joining
Once datasets are combined, new metrics can be created.
Inventory Turn:
sku_data["Turn"] = ( sku_data["Sales"] / sku_data["Inventory"])
View results:
print(sku_data)
Output:
| SKU | Inventory | Sales | Turn |
|---|---|---|---|
| A100 | 100 | 250 | 2.50 |
| B200 | 150 | 125 | 0.83 |
| C300 | 80 | 500 | 6.25 |
Identifying Unmatched Records
A useful trick:
result = pd.merge( sales, customers, on="CustomerID", how="left", indicator=True)print(result)
Output:
| CustomerID | Sales | CustomerName | _merge |
|---|---|---|---|
| 1 | 1000 | Alpha | both |
| 2 | 2000 | Beta | both |
| 4 | 1500 | NaN | left_only |
This helps identify missing matches.
Analyst Workflow
A common workflow looks like:
sales = pd.read_csv("sales.csv")customers = pd.read_csv("customers.csv")products = pd.read_csv("products.csv")data = pd.merge( sales, customers, on="CustomerID", how="left")data = pd.merge( data, products, on="ProductID", how="left")
The resulting dataset becomes the foundation for:
- Dashboards
- Statistical analysis
- Forecasting
- Machine learning
Common Join Patterns
Inner Join:
pd.merge( table1, table2, on="ID", how="inner")
Left Join:
pd.merge( table1, table2, on="ID", how="left")
Right Join:
pd.merge( table1, table2, on="ID", how="right")
Outer Join:
pd.merge( table1, table2, on="ID", how="outer")
Different Key Names:
pd.merge( table1, table2, left_on="CustID", right_on="CustomerID")
Lesson Summary
In this lesson we learned:
- Why joins are necessary
- What keys are
- How inner joins work
- How left joins work
- How right joins work
- How outer joins work
- How to join multiple datasets
- How to identify unmatched records
- How joins support analytics workflows
Joins are one of the most important skills in data analysis because almost every real-world project requires combining information from multiple sources.
In the next lesson we will learn Data Visualization, where we begin transforming data into charts, graphs, and business insights.

Leave a Reply