Data Cleaning Techniques

Handling Missing Data

Before training a machine learning model, it is important to ensure that the dataset is accurate, complete, and free from inconsistencies. Poor-quality data can lead to biased models, unreliable predictions, and reduced accuracy. Data cleaning addresses these challenges by identifying and correcting errors like missing values, duplicates, inconsistencies, and outliers.

Machine learning models rely on patterns in data to make predictions. If the data contains errors or inconsistencies, the model may learn misleading patterns, leading to incorrect outcomes. Common risks of using unclean data include bias in models, incorrect patterns, reduced accuracy, weakened model performance, and unreliable predictions. Applying appropriate data cleaning techniques can minimize these risks, helping models produce more meaningful and trustworthy insights.

Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:

Removal: In some cases, it might be appropriate to remove rows or columns with missing data, especially if the missingness is random and minimal.
Imputation: Replacing missing values with substituted data, such as the mean, median, or a predicted value based on other variables.
Flagging: Adding a flag to indicate that data is missing, which can be useful in tracking how missing data might affect your analysis.

Let's use this dataset as an example:

Sample Employee Data Including ID, Name, Age, and Salary
ID	Name	Age	Salary
101	Alice	25	55000
102	Bob		48000
103	Charlie	29
104	David	33	62000
105	Eve

Method 1: Removal

Google Sheets and Microsoft Excel

Removal
Google Sheets and Microsoft Excel

Python

import pandas as pd

# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, None, 29, 33, None],
        'Salary': [55000, 48000, None, 62000, None]}

df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned = df.dropna()

print(df_cleaned)

R

library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 105),
  Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  Age = c(25, NA, 29, 33, NA),
  Salary = c(55000, 48000, NA, 62000, NA)
)

# Remove rows with missing values
cleaned_data <- na.omit(data)

print(cleaned_data)

Result:

Sample Employee Data Subset: ID, Name, Age, and Salary
ID	Name	Age	Salary
101	Alice	25	55000
104	David	33	62000

Method 2: Imputation

Google Sheets and Microsoft Excel

Imputation
Google Sheets and Microsoft Excel

Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.

=IF(ISBLANK(C2), AVERAGE($C$2:$C$6), C2)

Python

# Impute missing values with the mean of the column
for column in df.columns:
    if df[column].dtype == 'float64' or df[column].dtype == 'int64': 
        df[column].fillna(df[column].mean(), inplace=True) 

print(df)

R

# Impute missing values with the mean
data <- data %>%
  mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

print(data)

Result:

Sample Employee Data: ID, Name, Age, and Salary
ID	Name	Age	Salary
101	Alice	25	55000
102	Bob	29	48000
103	Charlie	29	55000
104	David	33	62000
105	Eve	29	55000

Method 3: Flagging

Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.

Google Sheets and Microsoft Excel

Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).

=if(ISBLANK(A2),"Missing","Not Missing")

Then, you can use this formula to create a row-based flag across all original columns:

=IF(OR(ISBLANK(A2), ISBLANK(B2), ISBLANK(C2), ISBLANK(D2)), "Missing", "Not Missing")

Python

import pandas as pd

# Create a column-based flag
for column in df.columns:
    if df[column].isnull().any():
        df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')

# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')

print(df)

R

library(tidyverse)

# Create a column-based flag
data <- data %>%
  mutate_at(vars(everything()), 
            list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing")))

# Create a row-based flag
data <- data %>%
  rowwise() %>%
  mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing"))

print(data)

Result:

Sample Employee Data with Missing Value Flags
ID	Name	Age	Salary	ID_Flagged	Name_Flagged	Age_Flagged	Salary_Flagged	Missing_Flagged
101	Alice	25	55000	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
102	Bob		48000	Not Missing	Not Missing	Missing	Not Missing	Missing
103	Charlie	29		Not Missing	Not Missing	Not Missing	Missing	Missing
104	David	33	62000	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
105	Eve			Not Missing	Not Missing	Missing	Missing	Missing

Source: Stony Brook University Libraries, https://guides.library.stonybrook.edu/data-cleaning-and-wrangling/cleaning
This work is licensed under a Creative Commons Attribution 4.0 License.