Lesson 4: Data Integration and Joins: Combining Multiple Datasets

Infinity

3–5 minutes

course: python for data analysis, courses, data-science, python

Introduction

In Lesson 3, we learned how to summarize a single dataset using GroupBy.

However, in real-world analytics, the information you need is rarely contained in a single table.

For example:

Healthcare

You may have:

Patient table
Admissions table
Diagnosis table
Medication table

Supply Chain

You may have:

Sales table
Inventory table
Customer table
Product table

To perform meaningful analysis, these datasets must be combined.

This process is called a join.

If you know SQL, this is equivalent to:

JOIN

In Pandas, the primary function is:

pd.merge()

Understanding joins is one of the most important skills for a data analyst.

Why Do We Need Joins?

Suppose we have sales information.

			
import pandas as pd
sales = pd.DataFrame({
    "CustomerID": [1, 2, 3],
    "Sales": [1000, 2000, 3000]
})
print(sales)

		

Output:

CustomerID	Sales
1	1000
2	2000
3	3000

Suppose customer names are stored separately.

			
customers = pd.DataFrame({
    "CustomerID": [1, 2, 3],
    "CustomerName": [
        "Alpha",
        "Beta",
        "Gamma"
    ]
})
print(customers)

		

Output:

CustomerID	CustomerName
1	Alpha
2	Beta
3	Gamma

Neither table alone tells the full story.

We need to combine them.

Understanding Keys

A key is a column used to connect tables.

In our example:

CustomerID

exists in both tables.

This is called the join key.

Inner Join

The most common type of join.

Keep only records that exist in both tables.

			
result = pd.merge(
    sales,
    customers,
    on="CustomerID",
    how="inner"
)
print(result)

		

Output:

CustomerID	Sales	CustomerName
1	1000	Alpha
2	2000	Beta
3	3000	Gamma

What Does Inner Join Do?

Think of it as:

			
Sales
INTERSECTION
Customers

Only matching records survive.

Left Join

A left join keeps everything from the left table.

Example:

			
sales = pd.DataFrame({
    "CustomerID": [1, 2, 3, 4],
    "Sales": [1000, 2000, 3000, 1500]
})

Notice CustomerID 4 does not exist in the customer table.

Perform a left join:

			
result = pd.merge(
    sales,
    customers,
    on="CustomerID",
    how="left"
)
print(result)

		

Output:

CustomerID	Sales	CustomerName
1	1000	Alpha
2	2000	Beta
3	3000	Gamma
4	1500	NaN

Customer 4 remains because we kept all records from the left table.

This is one of the most common joins in analytics.

Right Join

Keep everything from the right table.

			
result = pd.merge(
    sales,
    customers,
    on="CustomerID",
    how="right"
)
print(result)

		

All customer records remain.

Less commonly used than left joins.

Outer Join

Keep everything from both tables.

			
result = pd.merge(
    sales,
    customers,
    on="CustomerID",
    how="outer"
)
print(result)

		

Output contains every record from both datasets.

Missing matches become NaN.

Visualizing Join Types

Suppose:

			
Sales IDs:
1 2 3 4
Customer IDs:
1 2 3 5

Inner Join:

1 2 3

Left Join:

1 2 3 4

Right Join:

1 2 3 5

Outer Join:

1 2 3 4 5

Joining on Different Column Names

Sometimes keys have different names.

Sales table:

			
sales = pd.DataFrame({
    "CustID": [1, 2, 3],
    "Sales": [1000, 2000, 3000]
})

Customer table:

			
customers = pd.DataFrame({
    "CustomerID": [1, 2, 3],
    "CustomerName": [
        "Alpha",
        "Beta",
        "Gamma"
    ]
})

		

Join:

			
result = pd.merge(
    sales,
    customers,
    left_on="CustID",
    right_on="CustomerID",
    how="inner"
)

		

Joining Multiple Tables

This is extremely common.

Suppose:

Sales table:

			
sales = pd.DataFrame({
    "ProductID": [101, 102, 103],
    "Sales": [500, 300, 700]
})

Product table:

			
products = pd.DataFrame({
    "ProductID": [101, 102, 103],
    "ProductName": [
        "Diamond A",
        "Diamond B",
        "Diamond C"
    ]
})

		

Category table:

			
categories = pd.DataFrame({
    "ProductName": [
        "Diamond A",
        "Diamond B",
        "Diamond C"
    ],
    "Category": [
        "Round",
        "Oval",
        "Princess"
    ]
})

		

First join:

			
sales_products = pd.merge(
    sales,
    products,
    on="ProductID"
)

		

Second join:

			
final = pd.merge(
    sales_products,
    categories,
    on="ProductName"
)

		

Healthcare Example

Patient information:

			
patients = pd.DataFrame({
    "PatientID": [1, 2, 3],
    "Age": [45, 67, 30]
})

Admissions:

			
admissions = pd.DataFrame({
    "PatientID": [1, 2, 3],
    "LengthOfStay": [5, 8, 2]
})

Combine:

			
patient_data = pd.merge(
    patients,
    admissions,
    on="PatientID"
)
print(patient_data)

		

Output:

PatientID	Age	LengthOfStay
1	45	5
2	67	8
3	30	2

Supply Chain Example

Inventory:

			
inventory = pd.DataFrame({
    "SKU": [
        "A100",
        "B200",
        "C300"
    ],
    "Inventory": [
        100,
        150,
        80
    ]
})

		

Sales:

			
sales = pd.DataFrame({
    "SKU": [
        "A100",
        "B200",
        "C300"
    ],
    "Sales": [
        250,
        125,
        500
    ]
})

		

Join:

			
sku_data = pd.merge(
    inventory,
    sales,
    on="SKU"
)
print(sku_data)

		

Output:

SKU	Inventory	Sales
A100	100	250
B200	150	125
C300	80	500

Creating New Metrics After Joining

Once datasets are combined, new metrics can be created.

Inventory Turn:

			
sku_data["Turn"] = (
    sku_data["Sales"]
    /
    sku_data["Inventory"]
)

		

View results:

print(sku_data)

Output:

SKU	Inventory	Sales	Turn
A100	100	250	2.50
B200	150	125	0.83
C300	80	500	6.25

Identifying Unmatched Records

A useful trick:

			
result = pd.merge(
    sales,
    customers,
    on="CustomerID",
    how="left",
    indicator=True
)
print(result)

		

Output:

CustomerID	Sales	CustomerName	_merge
1	1000	Alpha	both
2	2000	Beta	both
4	1500	NaN	left_only

This helps identify missing matches.

Analyst Workflow

A common workflow looks like:

			
sales = pd.read_csv("sales.csv")
customers = pd.read_csv("customers.csv")
products = pd.read_csv("products.csv")
data = pd.merge(
    sales,
    customers,
    on="CustomerID",
    how="left"
)
data = pd.merge(
    data,
    products,
    on="ProductID",
    how="left"
)

		

The resulting dataset becomes the foundation for:

Dashboards
Statistical analysis
Forecasting
Machine learning

Common Join Patterns

Inner Join:

			
pd.merge(
    table1,
    table2,
    on="ID",
    how="inner"
)

		

Left Join:

			
pd.merge(
    table1,
    table2,
    on="ID",
    how="left"
)

		

Right Join:

			
pd.merge(
    table1,
    table2,
    on="ID",
    how="right"
)

		

Outer Join:

			
pd.merge(
    table1,
    table2,
    on="ID",
    how="outer"
)

		

Different Key Names:

			
pd.merge(
    table1,
    table2,
    left_on="CustID",
    right_on="CustomerID"
)

		

Lesson Summary

In this lesson we learned:

Why joins are necessary
What keys are
How inner joins work
How left joins work
How right joins work
How outer joins work
How to join multiple datasets
How to identify unmatched records
How joins support analytics workflows

Joins are one of the most important skills in data analysis because almost every real-world project requires combining information from multiple sources.

In the next lesson we will learn Data Visualization, where we begin transforming data into charts, graphs, and business insights.

nerd-ish

Leave a ReplyCancel reply

Measure Theory Lesson 30: Radon Measures

Measure Theory Lesson 52: Bott Periodicity — The Miracle Behind K-Theory

Measure Theory Lesson 51: K-Theory for Operator Algebras — Measuring the Shape of a Space Without Points

Lesson 4: Data Integration and Joins: Combining Multiple Datasets

Introduction

Healthcare

Supply Chain

Why Do We Need Joins?

Understanding Keys

Inner Join

What Does Inner Join Do?

Left Join

Right Join

Outer Join

Visualizing Join Types

Joining on Different Column Names

Joining Multiple Tables

Healthcare Example

Supply Chain Example

Creating New Metrics After Joining

Identifying Unmatched Records

Analyst Workflow

Common Join Patterns

Lesson Summary

Share this:

Like this:

Related posts:

Leave a ReplyCancel reply

Discover more from nerd-ish