🎯 Goal:
By the end of this chapter, you’ll:
- Understand the importance of data cleaning in data science.
- Learn how to handle missing data, outliers, and duplicate data.
- Learn techniques to engineer features that improve model performance.
🧹 1. Data Cleaning: Why Is It Important?
Data cleaning involves preparing your data for analysis by handling issues such as:
- Missing values
- Outliers
- Inconsistent formatting
- Duplicates
📊 Clean data = accurate insights. If your data isn’t clean, even the best models will perform poorly.
🧰 2. Handling Missing Data
Missing data can occur for various reasons, like incomplete data collection or system errors.
You can handle missing data in different ways:
- Remove missing values
- Fill missing values with mean, median, mode, or a placeholder value
Example: Remove Missing Values
import pandas as pd
# Creating a sample dataframe
data = {"product": ["Ring", "Necklace", "Bracelet", None, "Earring"],
"price": [2500, 1800, 1500, 1700, None]}
df = pd.DataFrame(data)
# Drop rows with missing data
df_cleaned = df.dropna()
print(df_cleaned)
Example: Fill Missing Values
# Fill missing values with the median of the column
df["price"].fillna(df["price"].median(), inplace=True)
print(df)
🔢 3. Handling Outliers
Outliers are data points significantly different from the rest of the data.
You can detect outliers using:
- Visualizations: Boxplots, scatter plots
- Statistical methods: Z-scores, IQR (Interquartile Range)
Example: Removing Outliers Using IQR
# Compute IQR for 'price'
Q1 = df["price"].quantile(0.25)
Q3 = df["price"].quantile(0.75)
IQR = Q3 - Q1
# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter the dataset to remove outliers
df_cleaned = df[(df["price"] >= lower_bound) & (df["price"] <= upper_bound)]
print(df_cleaned)
🔁 4. Removing Duplicates
Duplicate data can cause bias and skew analysis. Removing duplicates ensures that you have unique entries.
Example: Removing Duplicates
# Remove duplicate rows
df = df.drop_duplicates()
print(df)
⚙️ 5. Feature Engineering: What Is It?
Feature engineering is the process of creating new features (variables) from existing data to improve model performance.
Key techniques include:
- Binning: Grouping continuous variables into discrete categories
- Normalization/Scaling: Scaling data to a specific range (e.g., 0 to 1)
- One-Hot Encoding: Converting categorical variables into binary variables (0 or 1)
- Polynomial Features: Adding powers of features to capture more complexity
Example: One-Hot Encoding
df = pd.DataFrame({
'product': ['Ring', 'Necklace', 'Bracelet', 'Earring'],
'category': ['Jewelry', 'Jewelry', 'Jewelry', 'Jewelry']
})
# Convert categorical data into numerical
df_encoded = pd.get_dummies(df, columns=["category"])
print(df_encoded)
📊 6. Feature Scaling
Feature scaling adjusts the range of the feature values to ensure that no variable dominates the model.
There are two common methods:
- Normalization: Rescaling data to a range of [0, 1]
- Standardization: Rescaling data to have a mean of 0 and standard deviation of 1
Example: Standardizing Features Using
StandardScaler
from sklearn.preprocessing import StandardScaler
# Sample data
data = {"price": [2500, 1800, 1500, 1700, 2200]}
df = pd.DataFrame(data)
# Standardizing the 'price' column
scaler = StandardScaler()
df["price_scaled"] = scaler.fit_transform(df[["price"]])
print(df)
🧑💻 7. Practice Time
Try these exercises:
- Create a dataframe with columns name, age, salary, and department.
- Remove rows with missing salary values and fill missing age values with the median.
- Identify and remove outliers in the salary column.
- Perform one-hot encoding on the department column.
- Normalize the salary column using Min-Max scaling.
✅ Summary
- Data cleaning is essential for ensuring that your analysis is accurate and trustworthy.
- Use various techniques like handling missing data, removing outliers, and eliminating duplicates to prepare your dataset.
- Feature engineering helps create new features or scale data to enhance your models.


Leave a comment