Data Cleaning Techniques
Handling Missing Data
Before training a machine learning model, it is important to ensure that the dataset is accurate, complete, and free from inconsistencies. Poor-quality data can lead to biased models, unreliable predictions, and reduced accuracy. Data cleaning addresses these challenges by identifying and correcting errors like missing values, duplicates, inconsistencies, and outliers.
Machine learning models rely on patterns in data to make predictions. If the data contains errors or inconsistencies, the model may learn misleading patterns, leading to incorrect outcomes. Common risks of using unclean data include bias in models, incorrect patterns, reduced accuracy, weakened model performance, and unreliable predictions. Applying appropriate data cleaning techniques can minimize these risks, helping models produce more meaningful and trustworthy insights.
Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:
- Removal: In some cases, it might be appropriate to remove rows or columns with missing data, especially if the missingness is random and minimal.
- Imputation: Replacing missing values with substituted data, such as the mean, median, or a predicted value based on other variables.
- Flagging: Adding a flag to indicate that data is missing, which can be useful in tracking how missing data might affect your analysis.
Let's use this dataset as an example:
| ID | Name | Age | Salary |
|---|---|---|---|
| 101 | Alice | 25 | 55000 |
| 102 | Bob | 48000 | |
| 103 | Charlie | 29 | |
| 104 | David | 33 | 62000 |
| 105 | Eve |
Method 1: Removal
Google Sheets and Microsoft Excel

Python
import pandas as pd
# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, None, 29, 33, None],
'Salary': [55000, 48000, None, 62000, None]}
df = pd.DataFrame(data)
# Remove rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)
R
library(tidyverse)
# Create the dataset
data <- data.frame(
ID = c(101, 102, 103, 104, 105),
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Age = c(25, NA, 29, 33, NA),
Salary = c(55000, 48000, NA, 62000, NA)
)
# Remove rows with missing values
cleaned_data <- na.omit(data)
print(cleaned_data)
Result:
| ID | Name | Age | Salary |
|---|---|---|---|
| 101 | Alice | 25 | 55000 |
| 104 | David | 33 | 62000 |
Method 2: Imputation
Google Sheets and Microsoft Excel

Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.
=IF(ISBLANK(C2), AVERAGE($C$2:$C$6), C2)
Python
# Impute missing values with the mean of the column for column in df.columns: if df[column].dtype == 'float64' or df[column].dtype == 'int64': df[column].fillna(df[column].mean(), inplace=True) print(df)
R
# Impute missing values with the mean data <- data %>% mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)) print(data)
Result:
| ID | Name | Age | Salary |
|---|---|---|---|
| 101 | Alice | 25 | 55000 |
| 102 | Bob | 29 | 48000 |
| 103 | Charlie | 29 | 55000 |
| 104 | David | 33 | 62000 |
| 105 | Eve | 29 | 55000 |
Method 3: Flagging
Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.
Google Sheets and Microsoft Excel
Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).
=if(ISBLANK(A2),"Missing","Not Missing")
Then, you can use this formula to create a row-based flag across all original columns:
=IF(OR(ISBLANK(A2), ISBLANK(B2), ISBLANK(C2), ISBLANK(D2)), "Missing", "Not Missing")
Python
import pandas as pd
# Create a column-based flag
for column in df.columns:
if df[column].isnull().any():
df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')
# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')
print(df)
R
library(tidyverse) # Create a column-based flag data <- data %>% mutate_at(vars(everything()), list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing"))) # Create a row-based flag data <- data %>% rowwise() %>% mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing")) print(data)
Result:
| ID | Name | Age | Salary | ID_Flagged | Name_Flagged | Age_Flagged | Salary_Flagged | Missing_Flagged |
|---|---|---|---|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | Not Missing | Not Missing | Not Missing | Not Missing | Not Missing |
| 102 | Bob | 48000 | Not Missing | Not Missing | Missing | Not Missing | Missing | |
| 103 | Charlie | 29 | Not Missing | Not Missing | Not Missing | Missing | Missing | |
| 104 | David | 33 | 62000 | Not Missing | Not Missing | Not Missing | Not Missing | Not Missing |
| 105 | Eve | Not Missing | Not Missing | Missing | Missing | Missing |
Source: Stony Brook University Libraries, https://guides.library.stonybrook.edu/data-cleaning-and-wrangling/cleaning
This work is licensed under a Creative Commons Attribution 4.0 License.