Data Cleaning Techniques

Handling Missing Data

Before training a machine learning model, it is important to ensure that the dataset is accurate, complete, and free from inconsistencies. Poor-quality data can lead to biased models, unreliable predictions, and reduced accuracy. Data cleaning addresses these challenges by identifying and correcting errors like missing values, duplicates, inconsistencies, and outliers.

Machine learning models rely on patterns in data to make predictions. If the data contains errors or inconsistencies, the model may learn misleading patterns, leading to incorrect outcomes. Common risks of using unclean data include bias in models, incorrect patterns, reduced accuracy, weakened model performance, and unreliable predictions. Applying appropriate data cleaning techniques can minimize these risks, helping models produce more meaningful and trustworthy insights.

Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:

Removal: In some cases, it might be appropriate to remove rows or columns with missing data, especially if the missingness is random and minimal.
Imputation: Replacing missing values with substituted data, such as the mean, median, or a predicted value based on other variables.
Flagging: Adding a flag to indicate that data is missing, which can be useful in tracking how missing data might affect your analysis.

Let's use this dataset as an example:

Sample Employee Data Including ID, Name, Age, and Salary
ID	Name	Age	Salary
101	Alice	25	55000
102	Bob		48000
103	Charlie	29
104	David	33	62000
105	Eve

Method 1: Removal

Google Sheets and Microsoft Excel

Removal
Google Sheets and Microsoft Excel

Python

import pandas as pd

# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, None, 29, 33, None],
        'Salary': [55000, 48000, None, 62000, None]}

df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned = df.dropna()

print(df_cleaned)

R

library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 105),
  Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  Age = c(25, NA, 29, 33, NA),
  Salary = c(55000, 48000, NA, 62000, NA)
)

# Remove rows with missing values
cleaned_data <- na.omit(data)

print(cleaned_data)

Result:

Sample Employee Data Subset: ID, Name, Age, and Salary
ID	Name	Age	Salary
101	Alice	25	55000
104	David	33	62000

Method 2: Imputation

Google Sheets and Microsoft Excel

Imputation
Google Sheets and Microsoft Excel

Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.

=IF(ISBLANK(C2), AVERAGE($C$2:$C$6), C2)

Python

# Impute missing values with the mean of the column
for column in df.columns:
    if df[column].dtype == 'float64' or df[column].dtype == 'int64': 
        df[column].fillna(df[column].mean(), inplace=True) 

print(df)

R

# Impute missing values with the mean
data <- data %>%
  mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

print(data)

Result:

Sample Employee Data: ID, Name, Age, and Salary
ID	Name	Age	Salary
101	Alice	25	55000
102	Bob	29	48000
103	Charlie	29	55000
104	David	33	62000
105	Eve	29	55000

Method 3: Flagging

Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.

Google Sheets and Microsoft Excel

Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).

=if(ISBLANK(A2),"Missing","Not Missing")

Then, you can use this formula to create a row-based flag across all original columns:

=IF(OR(ISBLANK(A2), ISBLANK(B2), ISBLANK(C2), ISBLANK(D2)), "Missing", "Not Missing")

Python

import pandas as pd

# Create a column-based flag
for column in df.columns:
    if df[column].isnull().any():
        df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')

# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')

print(df)

R

library(tidyverse)

# Create a column-based flag
data <- data %>%
  mutate_at(vars(everything()), 
            list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing")))

# Create a row-based flag
data <- data %>%
  rowwise() %>%
  mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing"))

print(data)

Result:

Sample Employee Data with Missing Value Flags
ID	Name	Age	Salary	ID_Flagged	Name_Flagged	Age_Flagged	Salary_Flagged	Missing_Flagged
101	Alice	25	55000	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
102	Bob		48000	Not Missing	Not Missing	Missing	Not Missing	Missing
103	Charlie	29		Not Missing	Not Missing	Not Missing	Missing	Missing
104	David	33	62000	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
105	Eve			Not Missing	Not Missing	Missing	Missing	Missing

Source: Stony Brook University Libraries, https://guides.library.stonybrook.edu/data-cleaning-and-wrangling/cleaning
This work is licensed under a Creative Commons Attribution 4.0 License.

Removing Duplicates

Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:

Exact Match: Finding rows that are completely identical across all columns.
Partial Match: Identifying duplicates based on a subset of columns (e.g., the same name and date, but different addresses).

Let's use this dataset as an example:

Employee Records Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
102	Bob	29	48000	LA
106	Alice	27	59000	LA
101	Alice	25	55000	NY

In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.

Google Sheets

Microsoft Excel

Python

import pandas as pd

# Sample dataset
data = {'ID': [101, 102, 103, 104, 102, 106, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'],
        'Age': [25, 29, 29, 33, 29, 27, 25],
        'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000],
        'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']}

# Original Dataset
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')

# Exact Match - based on all columns
df_cleaned = df.drop_duplicates()
print("Duplicates Removed - Exact Match")
print(df_cleaned, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_id = df.drop_duplicates(subset=['ID'])
print("Duplicates Removed - Partial Match based on ID")
print(df_cleaned_id, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_name = df.drop_duplicates(subset=['Name'])
print("Duplicates Removed - Partial Match based on Name")
print(df_cleaned_name, '\n')

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name'])
print("Duplicates Removed - Partial Match based on ID and Name")
print(df_cleaned_specific, '\n')

R

library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 102, 106, 101),
  Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"),
  Age = c(25, 29, 29, 33, 29, 27, 25),
  Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000),
  City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY")
)

# Original Dataset
cat("Original Dataset\n")
print(data)
cat("\n")

# Exact Match - based on all columns
data_cleaned <- distinct(data)
cat("Duplicates Removed - Exact Match\n")
print(data_cleaned)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_id <- distinct(data, ID, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID\n")
print(data_cleaned_id)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_name <- distinct(data, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on Name\n")
print(data_cleaned_name)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID and Name\n")
print(data_cleaned_specific)

Result:

Exact Match

Employee Information Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
106	Alice	27	59000	LA

Partial Match (ID)

Employee Data Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
106	Alice	27	59000	LA

Partial Match (Name)

Employee Information Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF

Partial Match (ID and Name)

Employee Data Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
106	Alice	27	59000	LA

As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.

Normalizing Data Formats

Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:

Standardizing Dates: Converting dates to a consistent format (e.g., YYYY-MM-DD).
Text Case Standardization: Ensuring that text data is consistently in uppercase or lowercase.
Unit Conversion: Converting measurements to a common unit (e.g., converting all weights to kilograms).

Case 1: Standardizing Dates

Let's use this dataset as an example:

Event Dates in Multiple Formats

Event ID	Event Date
1	12/01/2023
2	December 2, 2023
3	2023-12-03
4	12.04.2023
5	5th Dec 2023

Google Sheets and Microsoft Excel

Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.

Python

import pandas as pd
from dateutil.parser import parse

# Sample data
data = {
    'Event ID': [1, 2, 3, 4, 5],
    'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}

df = pd.DataFrame(data)

print("Original Dataset")
print(df, '\n')

# Custom function to parse dates
def parse_date(date_str):
    try:
        return parse(date_str, dayfirst=False)
    except ValueError:
        return pd.NaT

# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)

# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')

print("Result")
print(df)

R

library(tidyverse)
library(lubridate)

# Sample data
data <- tibble(
  `Event ID` = 1:5,
  `Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")

# Convert 'Event Date' to Date format
data <- data %>%
  mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))

# Standardize format to YYYY-MM-DD
data <- data %>%
  mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))

cat("Result")
print(data)

Result:

Normalized Event Dates (YYYY-MM-DD Format)

Event ID	Event Date
1	2023-12-01
2	2023-12-02
3	2023-12-03
4	2023-12-04
5	2023-12-05

Case 2: Text Case Standardization

Let's use this dataset as an example:

Product Inventory (Case Variants Preserved)

Product ID	Product Name
101	apple
102	Banana
103	ORANGE
104	baNana
105	Apple

Google Sheets and Microsoft Excel

Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:

=LOWER(B2)

You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.

Python

import pandas as pd

# Sample data
data = {
    'Product ID': [101, 102, 103, 104, 105],
    'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}

df = pd.DataFrame(data)

# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()

print(df)

R

library(tidyverse)

# Sample data
data <- tibble(
  `Product ID` = c(101, 102, 103, 104, 105),
  `Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)

# Convert 'Product Name' to lowercase
data <- data %>%
  mutate(`Product Name` = tolower(`Product Name`))

print(data)

Result:

Product List (with Duplicate Entries)

Product ID	Product Name
101	apple
102	banana
103	orange
104	banana
105	apple

Case 3: Unit Conversion

Let's use this dataset as an example:

Sample Weights with Varying Unit Formats

Sample ID	Weight
A1	70 kg
A2	154 lbs
A3	68 KG
A4	160 pound
A5	72 kilograms

Google Sheets and Microsoft Excel

Formulas:

1. Extracting numeric value from Weight (Column B) and placing it in Column C:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))

For both Google Sheets and Microsoft Excel:

=VALUE(LEFT(B2, FIND(" ", B2) - 1))

2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))

For both Google Sheets and Microsoft Excel:

=LOWER(TRIM(MID(B2, FIND(" ", B2) + 1, LEN(B2))))

3. Converting the weight to kg based on the original unit and placing it in Column E:

For both Google Sheets and Microsoft Excel:

=IF(OR(D2="kg", D2="kilogram", D2="kilograms"),C2, IF(OR(D2="lb", D2="lbs", D2="pound", D2="pounds"), CONVERT(C2, "lbm", "kg"), "Unknown Unit"))

Python

import pandas as pd

# Sample data
data = {
    'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
    'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}

df = pd.DataFrame(data)

# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)

# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series

# Conversion function
def convert_to_kg(row):
    if row['Unit'] in ['kg', 'kilograms']:
        return row['Numeric Value']
    elif row['Unit'] in ['lbs', 'pound']:
        return row['Numeric Value'] * 0.453592
    else:
        return None  # or handle unknown units

df['Weight in kg'] = df.apply(convert_to_kg, axis=1)

print(df[['Sample ID', 'Weight', 'Weight in kg']])

R

library(tidyverse)

# Sample data
data <- tibble(
  `Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
  Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)

# Extract numeric value and unit
data <- data %>%
  mutate(
    `Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
    Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
  )

# Conversion function
convert_to_kg <- function(value, unit) {
  if (unit %in% c('kg', 'kilograms')) {
    return(value)
  } else if (unit %in% c('lbs', 'pound')) {
    return(value * 0.453592)
  } else {
    return(NA)
  }
}

# Apply conversion
data <- data %>%
  rowwise() %>%
  mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
  ungroup()

print(data %>% select(`Sample ID`, Weight, `Weight in kg`))

Result:

Weight Normalization Table (All Values Converted to Kilograms)

Sample ID	Weight	Numeric Value	Unit	Weight_kg
A1	70 kg	70	kg	70
A2	154 lbs	154	lbs	69.85322498
A3	68 KG	68	kg	68
A4	160 pound	160	pound	72.5747792
A5	72 kilograms	72	kilograms	72

Outlier Detection and Handling

Outliers are data points that deviate significantly from other observations in the dataset. They can be the result of errors or true anomalies, and handling them appropriately is essential for accurate analysis. Below are some common methods for handling outliers:

Visual Inspection: Using plots like box plots or scatter plots to identify outliers visually.
Statistical Methods: Using standard deviations, interquartile range (IQR), or z-scores to detect outliers.
Decision on Handling: Decide whether to keep, remove, or transform outliers based on their impact on the analysis.

Let's use this annual income dataset as an example:

Sample Income Data by Person ID
Person ID	Income ($)
1	80,000
2	95,000
3	115,000
4	75,000
5	110,000
6	85,000
7	100,000
8	199,000
9	105,000
10	90,000

Visual Inspection

Python

import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {
    'Person ID': range(1, 11),
    'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}

df = pd.DataFrame(data)

# Box Plot
plt.figure(figsize=(8, 4))
plt.boxplot(df['Income'], vert=False, showfliers=True)
plt.title('Box Plot of Incomes')
plt.xlabel('Income ($000)')
plt.show()

# Scatter Plot
plt.figure(figsize=(8, 4))
plt.scatter(df['Person ID'], df['Income'])
plt.title('Scatter Plot of Incomes')
plt.xlabel('Person ID')
plt.ylabel('Income ($000)')
plt.show()

R

library(tidyverse)
# Sample data
data <- tibble(
  `Person ID` = 1:10,
  `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)

# Box Plot
ggplot(data, aes(y = Income)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8) +
  ggtitle("Box Plot of Incomes") +
  ylab("Income ($000)")

# Scatter Plot
ggplot(data, aes(x = `Person ID`, y = Income)) +
  geom_point() +
  ggtitle("Scatter Plot of Incomes") +
  xlab("Person ID") +
  ylab("Income ($000)")

A box plot graph of incomes. The box spans from about 80k to 110k, a whisker above the box and a data point outlier at 200K. A scatter plot of income versus person ID. Most incomes cluster around $75k-$125k, with one person earning over $190k.

Analysis:

Box Plot:

The box plot shows the distribution of incomes where the income of $198,000 appears as an outlier above the upper whisker.

Scatter Plot:

The scatter plot displays Person 8's income distinctly higher than others.

Statistical Methods

Python

import pandas as pd

# Sample data
data = {
    'Person ID': range(1, 11),
    'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}

df = pd.DataFrame(data)

# Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1                   

# Calculate bounds
lower_bound = Q1 - 1.5 * IQR    
upper_bound = Q3 + 1.5 * IQR    

# Identify outliers
df['Outlier'] = df['Income'].apply(lambda x: 'Outlier' if x > upper_bound else 'Normal')

print(df)

R

library(tidyverse)

# Sample data
data <- tibble(
  `Person ID` = 1:10,
  `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)

# Calculate Q1 and Q3
Q1 <- quantile(data$Income, 0.25)
Q3 <- quantile(data$Income, 0.75)
IQR <- Q3 - Q1                   

# Calculate bounds
lower_bound <- Q1 - 1.5 * IQR    
upper_bound <- Q3 + 1.5 * IQR    

# Identify outliers
data <- data %>%
  mutate(Outlier = ifelse(Income > upper_bound, "Outlier", "Normal"))

print(data)

Results:

Income Data with Outlier Detection
Person ID	Income ($)	Outlier
1	80,000	Normal
2	95,000	Normal
3	115,000	Normal
4	75,000	Normal
5	110,000	Normal
6	85,000	Normal
7	100,000	Normal
8	199,000	Outlier
9	105,000	Normal
10	90,000	Normal

Decision on Handling Outliers

Let's assess the impact of the identified outlier on the overall dataset.

Python

# Mean and Standard Deviation with Outlier
mean_with_outlier = df['Income'].mean()
std_with_outlier = df['Income'].std()

# Mean and Standard Deviation without Outlier
df_clean = df[df['Outlier'] == 'Normal']
mean_without_outlier = df_clean['Income'].mean()
std_without_outlier = df_clean['Income'].std()

print("Income with Outlier")
print(f"Mean: ${mean_with_outlier:,.2f}")
print(f"SD: ${std_with_outlier:,.2f}")
print("Income without Outlier")
print(f"Mean: ${mean_without_outlier:,.2f}")
print(f"SD: ${std_without_outlier:,.2f}")

R

# Mean and Standard Deviation with Outlier
mean_with_outlier <- mean(df$Income)
std_with_outlier <- sd(df$Income)

# Filter out the outliers
df_clean <- df %>% filter(Outlier == "Normal")

# Mean and Standard Deviation without Outlier
mean_without_outlier <- mean(df_clean$Income)
std_without_outlier <- sd(df_clean$Income)

# Print results
cat("Income with Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_with_outlier))
cat(sprintf("SD: $%.2f\n", std_with_outlier))

cat("\nIncome without Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_without_outlier))
cat(sprintf("SD: $%.2f\n", std_without_outlier))

Results:

Income with Outlier
Mean: $105,400.00
SD: $35,330.82
Income without Outlier
Mean: $95,000.00
SD: $13,693.06

As you can see, the outlier increases the mean by $10,400 and the standard deviation by $21,637.76. Based on this finding, you will have three options of action:

Keep the Outlier
- This is an option if you can verify the income is accurate and the person doesn't represent a different income group (e.g., executive level instead of regular level)
- This is also a good option if this data point in important for your purpose (e.g., showing income distribution and inequality)
Remove the Outlier
- This is an option if the outlier skews the data and is not representative.
- This is also appropriate for analyses focusing on the typical income range.
Transform the Data
- Applying a logarithmic transformation can help reduce skewness in your data distribution
- This is another option that allows inclusion of the outlier while minimizing its impact
- Transformed data might be more difficult to interpret

In the end, the decision on how to handle outliers is yours, as long as it is based on a thoughtful and informed rationale. Whether you choose to keep, remove, or transform outliers should depend on the context of your analysis and the impact these values have on your results. Outliers can provide valuable insights or skew your data, so it is important to consider the nature of your data, the objectives of your analysis, and how outliers affect the accuracy and interpretation of your findings. Ultimately, your approach should align with your goals, ensuring that the conclusions you draw are both robust and credible.

Correcting Data Entry Errors

Data entry errors are a common challenge, particularly in manual data collection processes. Left unaddressed, these errors can skew analyses, lead to incorrect conclusions, and compromise the integrity of your dataset. Therefore, it is essential to identify and correct them to ensure data accuracy and reliability. Below are some common methods for detecting and fixing data entry errors:

1. Validation Rules

Validation rules help prevent errors before they even occur by setting up conditions that data must meet. This method ensures that values entered into the system are consistent, logical, and fall within expected ranges. Examples of validation rules include:

Date of Birth cannot be a future date.
Age should be within a reasonable range (e.g., 0 to 120 years).
Email Address should follow the standard format (e.g., name@domain.com).

Some examples of validation rules are:

Google Sheets

Some examples of validation rules

Microsoft Excel

Python

import pandas as pd
from datetime import datetime             
# Sample data
data = {
    'Name': ['John Doe', 'Jane Smith', 'Alice Johnson'],
    'DOB': ['1990-01-15', '2094-05-10', '1985-07-23']  
}
df = pd.DataFrame(data)
# Convert 'DOB' to datetime and flag future dates
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')
df['Valid DOB'] = df['DOB'] <= pd.Timestamp(datetime.today())
print(df)

R

                library(tidyverse)
library(lubridate)             
# Sample data
data <- tibble(
  Name = c("John Doe", "Jane Smith", "Alice Johnson"),
  DOB = c("1990-01-15", "2094-05-10", "1985-07-23")
)
# Convert 'DOB' to datetime and flag future dates
data <- data %>%
    mutate(DOB = ymd(DOB),
           `Valid DOB` = DOB <= today())
print(data)

Results:

Example of Date of Birth Validation
Name	DOB	Valid DOB
John Doe	1990-01-15 0:00:00	TRUE
Jane Smith	2094-05-10 0:00:00	FALSE
Alice Johnson	1985-07-23 0:00:00	TRUE

2. Manual Review

For key variables, especially in small datasets, a manual review of entries is often necessary. This method involves going through the dataset by hand to identify inconsistencies, missing values, or obvious mistakes. While time-consuming, manual review is highly effective in catching subtle errors that automated tools might miss, such as transposed digits (e.g., entering "1987" instead of "1897" for a birth year).

3. Automated Checks

Automated checks leverage software tools to automatically identify and flag common data entry errors. These checks can be programmed to detect outliers, incorrect formats, missing values, or other anomalies. Examples of automated checks include:

Range Checks: Ensuring all values fall within predefined limits.
Consistency Checks: Verifying that related fields make sense together (e.g., the “end date” of an event cannot be earlier than the “start date”).
Format Validation: Check that data entries adhere to a specific format (e.g., email addresses must include "@" and a domain).
Spelling and Variations Detection: Automated tools can identify spelling errors and variations in text data (e.g., “USA,” “United States,” “U.S.”). This ensures consistency in categorical data, which can otherwise lead to inaccurate groupings or analysis. Some tools can flag these inconsistencies or automatically correct them to a predefined standard.

Automated tools like Apps Script in Google Sheets, Macros in Microsoft Excel, Python, and R allow you to perform these checks efficiently, especially for larger datasets. Python and R, in particular, offer robust libraries for data validation and error correction.

Site:	Saylor University
Course:	CS207: Fundamentals of Machine Learning
Book:	Data Cleaning Techniques

Printed by:	Guest user
Date:	Thursday, July 16, 2026, 9:55 AM

Description

Table of contents

Handling Missing Data

Method 1: Removal

Google Sheets and Microsoft Excel

Python

R

Result:

Method 2: Imputation

Google Sheets and Microsoft Excel

Python

R

Result:

Method 3: Flagging

Google Sheets and Microsoft Excel

Python

R

Result:

Removing Duplicates

Google Sheets

Microsoft Excel

Python

R

Result:

Normalizing Data Formats

Case 1: Standardizing Dates

Google Sheets and Microsoft Excel

Python

R

Result:

Case 2: Text Case Standardization

Google Sheets and Microsoft Excel

Python

R

Result:

Case 3: Unit Conversion

Google Sheets and Microsoft Excel

Python

R

Result:

Outlier Detection and Handling

Visual Inspection

Python

R

Analysis:

Statistical Methods

Python

Results:

Decision on Handling Outliers

Python

Results:

Correcting Data Entry Errors

1. Validation Rules

Google Sheets

Microsoft Excel

Python

R

Results:

2. Manual Review

3. Automated Checks