Data Cleaning in Python Programming Language

Data cleaning code series in Python

Are you a business owner?
Are you engaged with a lot of datasets?
To achieve long-lasting business data integrity, not only for your current operations but also for future competitive reliance on that data, you need to ensure it is clean and optimized.
Data cleaning is a crucial step in the business data analysis process to ensure that the data you work with is accurate, reliable, and ready for analysis. Also known as data cleansing or data scrubbing, it involves detecting and correcting (or removing) errors and inconsistencies in data to improve its quality.
Here’s a series of Python code algorithms using libraries like Pandas and NumPy that you can use to optimize your business data.

import numpy as np
import pandas as pd
from scipy import stats
# Load data (replace 'data.csv' with your actual dataset)
df = pd.read_csv("data.csv")

# 1. Remove duplicates
df.drop_duplicates(inplace=True)
print("1. Duplicates removed")

# 2. Handle missing values
df.dropna(inplace=True) # Drop rows with missing values
# Alternatively, fill missing values with mean
# df.fillna(df.mean(), inplace=True)
print("2. Missing values handled")

# 3. Correct inconsistencies
# Example: Standardize date format
df["date"] = pd.to_datetime(df["date"], errors="coerce")
print("3. Inconsistencies in date format corrected")

# 4. Remove outliers
# Example: Remove outliers using z-score
df = df[(np.abs(stats.zscore(df.select_dtypes(include=[np.number]))) &amp;amp;lt; 3).all(axis=1)]
print("4. Outliers removed")

# 5. Fix structural errors
# Example: Correct typos in 'category' column
df["category"] = (
df["category"].str.lower().str.replace("misspeling", "correct_spelling")
)
print("5. Structural errors in 'category' column fixed")

# 6. Normalize data
# Example: Normalize numeric data to range [0, 1]
df["normalized_col"] = (df["numeric_col"] - df["numeric_col"].min()) / (
df["numeric_col"].max() - df["numeric_col"].min()
)
print("6. Numeric data normalized")

# 7. Handle inconsistent casing
# Example: Convert text data to lowercase
df["text_col"] = df["text_col"].str.lower()
print("7. Text data converted to lowercase")

# 8. Remove special characters
# Example: Remove special characters from 'text_col'
df["text_col"] = df["text_col"].str.replace(r"[^\w\s]", "")
print("8. Special characters removed from 'text_col'")

# 9. Remove leading/trailing whitespaces
# Example: Strip leading and trailing whitespaces from strings
df["text_col"] = df["text_col"].str.strip()
print("9. Leading and trailing whitespaces removed from 'text_col'")

# 10. Handle categorical data
# Example: Convert categorical data to numerical using label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["category_encoded"] = le.fit_transform(df["category"])
print("10. Categorical data encoded")

# 11. Handle datetime data
# Example: Extract year, month, day from datetime column
df["year"] = df["datetime_col"].dt.year
df["month"] = df["datetime_col"].dt.month
df["day"] = df["datetime_col"].dt.day
print("11. Date components extracted")

# 12. Convert data types
# Example: Convert 'numeric_col' to float
df["numeric_col"] = df["numeric_col"].astype(float)
print("12. Data type converted to float")

# 13. Handle inconsistent units
# Example: Convert units in 'unit_col' from pounds to kilograms
df["kg_col"] = df["unit_col"] * 0.453592
print("13. Units converted")

# 14. Binning data
# Example: Bin 'numeric_col' into categories based on predefined bins
bins = [0, 25, 50, 75, 100]
labels = ["Low", "Medium", "High", "Very High"]
df["binned_col"] = pd.cut(df["numeric_col"], bins=bins, labels=labels)
print("14. Data binned into categories")

# 15. Handle outliers
# Example: Replace outliers in 'numeric_col' with mean value
mean_value = df["numeric_col"].mean()
std_dev = df["numeric_col"].std()
df["numeric_col"] = np.where(
(df["numeric_col"] - mean_value) / std_dev &gt; 3, mean_value, df["numeric_col"]
)
print("15. Outliers in 'numeric_col' handled")

# 16. Validate data integrity
# Example: Assert that all 'numeric_col' values are positive
assert (df["numeric_col"] &gt; 0).all(), "All values in 'numeric_col' must be positive"
print("16. Data integrity validated")

# 17. Save cleaned data to a new CSV file
df.to_csv("cleaned_data.csv", index=False)
print("17. Cleaned data saved to 'cleaned_data.csv'")
print("Data cleaning complete. Data is ready for analysis.")

5/5 - (2 votes)