Data Cleaning Techniques

Handling Missing Data

Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:

  1. Removal: In some cases, it might be appropriate to remove rows or columns with missing data, especially if the missingness is random and minimal.
  2. Imputation: Replacing missing values with substituted data, such as the mean, median, or a predicted value based on other variables.
  3. Flagging: Adding a flag to indicate that data is missing, which can be useful in tracking how missing data might affect your analysis.

Let's use this dataset as an example:

Sample Employee Data Including ID, Name, Age, and Salary
ID Name Age Salary
101 Alice 25 55000
102 Bob   48000
103 Charlie 29  
104 David 33 62000
105 Eve    


Method 1: Removal

Google Sheets and Microsoft Excel

 Removal
Google Sheets and Microsoft Excel

Python
import pandas as pd

# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, None, 29, 33, None],
        'Salary': [55000, 48000, None, 62000, None]}

df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned = df.dropna()

print(df_cleaned)
R
library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 105),
  Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  Age = c(25, NA, 29, 33, NA),
  Salary = c(55000, 48000, NA, 62000, NA)
)

# Remove rows with missing values
cleaned_data <- na.omit(data)

print(cleaned_data)
Result:
Sample Employee Data Subset: ID, Name, Age, and Salary
ID Name Age Salary
101 Alice 25 55000
104 David 33 62000


Method 2: Imputation

Google Sheets and Microsoft Excel

 Imputation
Google Sheets and Microsoft Excel

Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.

=IF(ISBLANK(C2), AVERAGE($C$2:$C$6), C2)

Python
# Impute missing values with the mean of the column
for column in df.columns:
    if df[column].dtype == 'float64' or df[column].dtype == 'int64': 
        df[column].fillna(df[column].mean(), inplace=True) 

print(df)
R
# Impute missing values with the mean
data <- data %>%
  mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

print(data)
Result:
Sample Employee Data: ID, Name, Age, and Salary
ID Name Age Salary
101 Alice 25 55000
102 Bob 29 48000
103 Charlie 29 55000
104 David 33 62000
105 Eve 29 55000


Method 3: Flagging

Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.

Google Sheets and Microsoft Excel

Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).

=if(ISBLANK(A2),"Missing","Not Missing")

Then, you can use this formula to create a row-based flag across all original columns:

=IF(OR(ISBLANK(A2), ISBLANK(B2), ISBLANK(C2), ISBLANK(D2)), "Missing", "Not Missing")

Python
import pandas as pd

# Create a column-based flag
for column in df.columns:
    if df[column].isnull().any():
        df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')

# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')

print(df)
R
library(tidyverse)

# Create a column-based flag
data <- data %>%
  mutate_at(vars(everything()), 
            list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing")))

# Create a row-based flag
data <- data %>%
  rowwise() %>%
  mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing"))

print(data)
Result:
Sample Employee Data with Missing Value Flags
ID Name Age Salary ID_Flagged Name_Flagged Age_Flagged Salary_Flagged Missing_Flagged
101 Alice 25 55000 Not Missing Not Missing Not Missing Not Missing Not Missing
102 Bob 48000 Not Missing Not Missing Missing Not Missing Missing
103 Charlie 29 Not Missing Not Missing Not Missing Missing Missing
104 David 33 62000 Not Missing Not Missing Not Missing Not Missing Not Missing
105 Eve Not Missing Not Missing Missing Missing Missing

Source: Stony Brook University Libraries, https://guides.library.stonybrook.edu/data-cleaning-and-wrangling/cleaning
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.