Lesson 1: Pandas Foundations: The Most Important Tool for Data Analysts

Infinity

3–5 minutes

Coding, course: python for data analysis, courses, data-science

Introduction

If Excel is the calculator of modern business, then Pandas is the spreadsheet on steroids.

Almost every data analyst using Python spends most of their time working with Pandas. Whether you are analyzing healthcare data, sales transactions, inventory movements, customer behavior, or operational performance, the first step is almost always loading data into a Pandas DataFrame.

This lesson introduces the fundamental object used throughout Python data analysis: the DataFrame.

By the end of this lesson, you will be able to:

Import data into Python
Understand DataFrames and Series
Inspect datasets
Select rows and columns
Understand data types
Perform basic exploratory analysis

Throughout this course we will use examples from healthcare and supply chain analytics.

What is Pandas?

Pandas is a Python library designed for working with structured data.

Think of a DataFrame as a spreadsheet inside Python.

For example:

PatientID	Age	Diagnosis	LengthOfStay
101	45	Pneumonia	5
102	67	COPD	8
103	31	Asthma	2

SKU	Sales	Inventory
A100	250	100
B200	125	150
C300	500	80

These tables become DataFrames in Pandas.

Installing Pandas

If Pandas is not already installed:

pip install pandas

Import it using:

import pandas as pd

The abbreviation pd is the standard convention used almost everywhere.

The DataFrame

A DataFrame is a two-dimensional table consisting of rows and columns.

Example:

			
import pandas as pd
df = pd.DataFrame({
    "PatientID":[101,102,103],
    "Age":[45,67,31],
    "Diagnosis":["Pneumonia","COPD","Asthma"]
})
print(df)

		

Output:

   PatientID  Age  Diagnosis
0        101   45  Pneumonia
1        102   67       COPD
2        103   31     Asthma

Notice that Pandas automatically creates row numbers called indexes.

The Series

A Series is a single column of data.

Example:

df["Age"]

Output:

A DataFrame is essentially a collection of Series objects.

Think:

Series = one column
DataFrame = entire table

Loading Data

Most real-world analysis starts by importing a file.

CSV files:

df = pd.read_csv("patients.csv")

Excel files:

df = pd.read_excel("patients.xlsx")

Supply chain example:

sales = pd.read_csv("sales_history.csv")

Healthcare example:

patients = pd.read_csv("hospital_admissions.csv")

Viewing the First Rows

When receiving a new dataset, the first thing analysts do is inspect it.

View first five rows:

df.head()

Example:

patients.head()

Output:

			
PatientID  Age  Diagnosis
101        45   Pneumonia
102        67   COPD
103        31   Asthma
...

		

View first ten rows:

df.head(10)

Viewing the Last Rows

Sometimes the end of the file is important.

df.tail()

df.tail(10)

This is useful when checking imports and exports.

Understanding Dataset Structure

One of the most useful commands:

df.info()

Example output:

			
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries
Data columns:
PatientID       int64
Age             int64
Diagnosis       object
LengthOfStay    float64

		

This tells us:

Number of rows
Number of columns
Variable types
Missing values

Always run this immediately after loading data.

Understanding Data Types

Common data types include:

Type	Meaning
int64	Integers
float64	Decimal numbers
object	Text
bool	True/False
datetime64	Dates

Example:

df.dtypes

Output:

			
PatientID        int64
Age              int64
Diagnosis       object
LengthOfStay   float64

Understanding data types prevents many analysis errors.

Dataset Dimensions

How many rows and columns?

df.shape

Output:

(1000, 12)

Meaning:

1000 rows
12 columns

Extremely useful.

Column Names

View all columns:

df.columns

Example:

			
PatientID
Age
Diagnosis
LengthOfStay

This is often one of the first commands analysts run.

Selecting Columns

Single column:

df["Age"]

Multiple columns:

df[["Age","Diagnosis"]]

Healthcare example:

patients[["Age","LengthOfStay"]]

Supply chain example:

inventory[["SKU","Sales"]]

Selecting Rows

Select by row position:

df.iloc[0]

Returns first row.

Select multiple rows:

df.iloc[0:5]

Returns first five rows.

Selecting Rows and Columns Together

Example:

df.iloc[0:5,0:3]

Meaning:

First five rows
First three columns

Very useful during exploration.

Summary Statistics

Quick summary:

df.describe()

Example output:

			
Age
count   1000
mean      47.2
std       15.8
min       18
max       89

		

This provides:

Mean
Standard deviation
Minimum
Maximum
Quartiles

A powerful first look at the data.

Healthcare Example

Suppose we have hospital admissions:

patients = pd.read_csv("hospital_admissions.csv")

Investigate:

			
patients.head()
patients.info()
patients.describe()

Questions:

Average age?
Average length of stay?
Missing diagnoses?
Number of admissions?

This is the beginning of exploratory analysis.

Supply Chain Example

Suppose we have sales history:

sales = pd.read_csv("sales.csv")

Investigate:

			
sales.head()
sales.info()
sales.describe()

Questions:

Average sales?
Largest SKU?
Inventory distribution?
Number of active products?

Again, the first stage of every analysis project.

Common Beginner Workflow

Whenever you receive a new dataset:

			
import pandas as pd
df = pd.read_csv("file.csv")
df.head()
df.info()
df.shape
df.columns
df.describe()

		

This sequence alone will quickly reveal most important characteristics of the dataset.

Lesson Summary

In this lesson we learned:

What Pandas is
What a DataFrame is
What a Series is
How to load CSV files
How to load Excel files
How to inspect data
How to view rows and columns
How to understand data types
How to generate summary statistics

These skills form the foundation of every Python-based data analysis project.

In the next lesson we will learn Data Cleaning, where we begin dealing with missing values, duplicate records, inconsistent data, and the messy realities of real-world datasets.

nerd-ish

Leave a ReplyCancel reply

Measure Theory Lesson 30: Radon Measures

Measure Theory Lesson 52: Bott Periodicity — The Miracle Behind K-Theory

Measure Theory Lesson 51: K-Theory for Operator Algebras — Measuring the Shape of a Space Without Points

Lesson 1: Pandas Foundations: The Most Important Tool for Data Analysts

Introduction

What is Pandas?

Installing Pandas

The DataFrame

The Series

Loading Data

Viewing the First Rows

Viewing the Last Rows

Understanding Dataset Structure

Understanding Data Types

Dataset Dimensions

Column Names

Selecting Columns

Selecting Rows

Selecting Rows and Columns Together

Summary Statistics

Healthcare Example

Supply Chain Example

Common Beginner Workflow

Lesson Summary

Share this:

Like this:

Related posts:

Leave a ReplyCancel reply

Discover more from nerd-ish