Lesson 4: Data Integration and Joins: Combining Multiple Datasets

Introduction

In Lesson 3, we learned how to summarize a single dataset using GroupBy.

However, in real-world analytics, the information you need is rarely contained in a single table.

For example:

Healthcare

You may have:

  • Patient table
  • Admissions table
  • Diagnosis table
  • Medication table

Supply Chain

You may have:

  • Sales table
  • Inventory table
  • Customer table
  • Product table

To perform meaningful analysis, these datasets must be combined.

This process is called a join.

If you know SQL, this is equivalent to:

JOIN

In Pandas, the primary function is:

pd.merge()

Understanding joins is one of the most important skills for a data analyst.


Why Do We Need Joins?

Suppose we have sales information.

import pandas as pd
sales = pd.DataFrame({
"CustomerID": [1, 2, 3],
"Sales": [1000, 2000, 3000]
})
print(sales)

Output:

CustomerIDSales
11000
22000
33000

Suppose customer names are stored separately.

customers = pd.DataFrame({
"CustomerID": [1, 2, 3],
"CustomerName": [
"Alpha",
"Beta",
"Gamma"
]
})
print(customers)

Output:

CustomerIDCustomerName
1Alpha
2Beta
3Gamma

Neither table alone tells the full story.

We need to combine them.


Understanding Keys

A key is a column used to connect tables.

In our example:

CustomerID

exists in both tables.

This is called the join key.


Inner Join

The most common type of join.

Keep only records that exist in both tables.

result = pd.merge(
sales,
customers,
on="CustomerID",
how="inner"
)
print(result)

Output:

CustomerIDSalesCustomerName
11000Alpha
22000Beta
33000Gamma

What Does Inner Join Do?

Think of it as:

Sales
INTERSECTION
Customers

Only matching records survive.


Left Join

A left join keeps everything from the left table.

Example:

sales = pd.DataFrame({
"CustomerID": [1, 2, 3, 4],
"Sales": [1000, 2000, 3000, 1500]
})

Notice CustomerID 4 does not exist in the customer table.

Perform a left join:

result = pd.merge(
sales,
customers,
on="CustomerID",
how="left"
)
print(result)

Output:

CustomerIDSalesCustomerName
11000Alpha
22000Beta
33000Gamma
41500NaN

Customer 4 remains because we kept all records from the left table.

This is one of the most common joins in analytics.


Right Join

Keep everything from the right table.

result = pd.merge(
sales,
customers,
on="CustomerID",
how="right"
)
print(result)

All customer records remain.

Less commonly used than left joins.


Outer Join

Keep everything from both tables.

result = pd.merge(
sales,
customers,
on="CustomerID",
how="outer"
)
print(result)

Output contains every record from both datasets.

Missing matches become NaN.


Visualizing Join Types

Suppose:

Sales IDs:
1 2 3 4
Customer IDs:
1 2 3 5

Inner Join:

1 2 3

Left Join:

1 2 3 4

Right Join:

1 2 3 5

Outer Join:

1 2 3 4 5

Joining on Different Column Names

Sometimes keys have different names.

Sales table:

sales = pd.DataFrame({
"CustID": [1, 2, 3],
"Sales": [1000, 2000, 3000]
})

Customer table:

customers = pd.DataFrame({
"CustomerID": [1, 2, 3],
"CustomerName": [
"Alpha",
"Beta",
"Gamma"
]
})

Join:

result = pd.merge(
sales,
customers,
left_on="CustID",
right_on="CustomerID",
how="inner"
)

Joining Multiple Tables

This is extremely common.

Suppose:

Sales table:

sales = pd.DataFrame({
"ProductID": [101, 102, 103],
"Sales": [500, 300, 700]
})

Product table:

products = pd.DataFrame({
"ProductID": [101, 102, 103],
"ProductName": [
"Diamond A",
"Diamond B",
"Diamond C"
]
})

Category table:

categories = pd.DataFrame({
"ProductName": [
"Diamond A",
"Diamond B",
"Diamond C"
],
"Category": [
"Round",
"Oval",
"Princess"
]
})

First join:

sales_products = pd.merge(
sales,
products,
on="ProductID"
)

Second join:

final = pd.merge(
sales_products,
categories,
on="ProductName"
)

Healthcare Example

Patient information:

patients = pd.DataFrame({
"PatientID": [1, 2, 3],
"Age": [45, 67, 30]
})

Admissions:

admissions = pd.DataFrame({
"PatientID": [1, 2, 3],
"LengthOfStay": [5, 8, 2]
})

Combine:

patient_data = pd.merge(
patients,
admissions,
on="PatientID"
)
print(patient_data)

Output:

PatientIDAgeLengthOfStay
1455
2678
3302

Supply Chain Example

Inventory:

inventory = pd.DataFrame({
"SKU": [
"A100",
"B200",
"C300"
],
"Inventory": [
100,
150,
80
]
})

Sales:

sales = pd.DataFrame({
"SKU": [
"A100",
"B200",
"C300"
],
"Sales": [
250,
125,
500
]
})

Join:

sku_data = pd.merge(
inventory,
sales,
on="SKU"
)
print(sku_data)

Output:

SKUInventorySales
A100100250
B200150125
C30080500

Creating New Metrics After Joining

Once datasets are combined, new metrics can be created.

Inventory Turn:

sku_data["Turn"] = (
sku_data["Sales"]
/
sku_data["Inventory"]
)

View results:

print(sku_data)

Output:

SKUInventorySalesTurn
A1001002502.50
B2001501250.83
C300805006.25

Identifying Unmatched Records

A useful trick:

result = pd.merge(
sales,
customers,
on="CustomerID",
how="left",
indicator=True
)
print(result)

Output:

CustomerIDSalesCustomerName_merge
11000Alphaboth
22000Betaboth
41500NaNleft_only

This helps identify missing matches.


Analyst Workflow

A common workflow looks like:

sales = pd.read_csv("sales.csv")
customers = pd.read_csv("customers.csv")
products = pd.read_csv("products.csv")
data = pd.merge(
sales,
customers,
on="CustomerID",
how="left"
)
data = pd.merge(
data,
products,
on="ProductID",
how="left"
)

The resulting dataset becomes the foundation for:

  • Dashboards
  • Statistical analysis
  • Forecasting
  • Machine learning

Common Join Patterns

Inner Join:

pd.merge(
table1,
table2,
on="ID",
how="inner"
)

Left Join:

pd.merge(
table1,
table2,
on="ID",
how="left"
)

Right Join:

pd.merge(
table1,
table2,
on="ID",
how="right"
)

Outer Join:

pd.merge(
table1,
table2,
on="ID",
how="outer"
)

Different Key Names:

pd.merge(
table1,
table2,
left_on="CustID",
right_on="CustomerID"
)

Lesson Summary

In this lesson we learned:

  • Why joins are necessary
  • What keys are
  • How inner joins work
  • How left joins work
  • How right joins work
  • How outer joins work
  • How to join multiple datasets
  • How to identify unmatched records
  • How joins support analytics workflows

Joins are one of the most important skills in data analysis because almost every real-world project requires combining information from multiple sources.

In the next lesson we will learn Data Visualization, where we begin transforming data into charts, graphs, and business insights.

Leave a Reply

Discover more from nerd-ish

Subscribe now to keep reading and get access to the full archive.

Continue reading