Python 12: Data Cleaning and Feature Engineering 


🎯 Goal:

By the end of this chapter, you’ll:

  • Understand the importance of data cleaning in data science.
  • Learn how to handle missing data, outliers, and duplicate data.
  • Learn techniques to engineer features that improve model performance.

🧹 1. Data Cleaning: Why Is It Important?

Data cleaning involves preparing your data for analysis by handling issues such as:

  • Missing values
  • Outliers
  • Inconsistent formatting
  • Duplicates

📊 Clean data = accurate insights. If your data isn’t clean, even the best models will perform poorly.


🧰 2. Handling Missing Data

Missing data can occur for various reasons, like incomplete data collection or system errors.

You can handle missing data in different ways:

  • Remove missing values
  • Fill missing values with mean, median, mode, or a placeholder value

Example: Remove Missing Values

import pandas as pd

# Creating a sample dataframe
data = {"product": ["Ring", "Necklace", "Bracelet", None, "Earring"],
        "price": [2500, 1800, 1500, 1700, None]}

df = pd.DataFrame(data)

# Drop rows with missing data
df_cleaned = df.dropna()
print(df_cleaned)

Example: Fill Missing Values

# Fill missing values with the median of the column
df["price"].fillna(df["price"].median(), inplace=True)
print(df)

🔢 3. Handling Outliers

Outliers are data points significantly different from the rest of the data.

You can detect outliers using:

  • Visualizations: Boxplots, scatter plots
  • Statistical methods: Z-scores, IQR (Interquartile Range)

Example: Removing Outliers Using IQR

# Compute IQR for 'price'
Q1 = df["price"].quantile(0.25)
Q3 = df["price"].quantile(0.75)
IQR = Q3 - Q1

# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter the dataset to remove outliers
df_cleaned = df[(df["price"] >= lower_bound) & (df["price"] <= upper_bound)]
print(df_cleaned)

🔁 4. Removing Duplicates

Duplicate data can cause bias and skew analysis. Removing duplicates ensures that you have unique entries.

Example: Removing Duplicates

# Remove duplicate rows
df = df.drop_duplicates()
print(df)

⚙️ 5. Feature Engineering: What Is It?

Feature engineering is the process of creating new features (variables) from existing data to improve model performance.

Key techniques include:

  • Binning: Grouping continuous variables into discrete categories
  • Normalization/Scaling: Scaling data to a specific range (e.g., 0 to 1)
  • One-Hot Encoding: Converting categorical variables into binary variables (0 or 1)
  • Polynomial Features: Adding powers of features to capture more complexity

Example: One-Hot Encoding

df = pd.DataFrame({
    'product': ['Ring', 'Necklace', 'Bracelet', 'Earring'],
    'category': ['Jewelry', 'Jewelry', 'Jewelry', 'Jewelry']
})

# Convert categorical data into numerical
df_encoded = pd.get_dummies(df, columns=["category"])
print(df_encoded)

📊 6. Feature Scaling

Feature scaling adjusts the range of the feature values to ensure that no variable dominates the model.

There are two common methods:

  1. Normalization: Rescaling data to a range of [0, 1]
  2. Standardization: Rescaling data to have a mean of 0 and standard deviation of 1

Example: Standardizing Features Using 

StandardScaler

from sklearn.preprocessing import StandardScaler

# Sample data
data = {"price": [2500, 1800, 1500, 1700, 2200]}
df = pd.DataFrame(data)

# Standardizing the 'price' column
scaler = StandardScaler()
df["price_scaled"] = scaler.fit_transform(df[["price"]])
print(df)

🧑‍💻 7. Practice Time

Try these exercises:

  1. Create a dataframe with columns name, age, salary, and department.
  2. Remove rows with missing salary values and fill missing age values with the median.
  3. Identify and remove outliers in the salary column.
  4. Perform one-hot encoding on the department column.
  5. Normalize the salary column using Min-Max scaling.

✅ Summary

  • Data cleaning is essential for ensuring that your analysis is accurate and trustworthy.
  • Use various techniques like handling missing data, removing outliers, and eliminating duplicates to prepare your dataset.
  • Feature engineering helps create new features or scale data to enhance your models.

Comments

Leave a comment