Data Cleaning Techniques
| Site: | Saylor University |
| Course: | CS207: Fundamentals of Machine Learning |
| Book: | Data Cleaning Techniques |
| Printed by: | Guest user |
| Date: | Wednesday, April 15, 2026, 7:12 PM |
Description
Handling Missing Data
Before training a machine learning model, it is important to ensure that the dataset is accurate, complete, and free from inconsistencies. Poor-quality data can lead to biased models, unreliable predictions, and reduced accuracy. Data cleaning addresses these challenges by identifying and correcting errors like missing values, duplicates, inconsistencies, and outliers.
Machine learning models rely on patterns in data to make predictions. If the data contains errors or inconsistencies, the model may learn misleading patterns, leading to incorrect outcomes. Common risks of using unclean data include bias in models, incorrect patterns, reduced accuracy, weakened model performance, and unreliable predictions. Applying appropriate data cleaning techniques can minimize these risks, helping models produce more meaningful and trustworthy insights.
Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:
- Removal: In some cases, it might be appropriate to remove rows or columns with missing data, especially if the missingness is random and minimal.
- Imputation: Replacing missing values with substituted data, such as the mean, median, or a predicted value based on other variables.
- Flagging: Adding a flag to indicate that data is missing, which can be useful in tracking how missing data might affect your analysis.
Let's use this dataset as an example:
| ID | Name | Age | Salary |
|---|---|---|---|
| 101 | Alice | 25 | 55000 |
| 102 | Bob | 48000 | |
| 103 | Charlie | 29 | |
| 104 | David | 33 | 62000 |
| 105 | Eve |
Method 1: Removal
Google Sheets and Microsoft Excel

Python
import pandas as pd
# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, None, 29, 33, None],
'Salary': [55000, 48000, None, 62000, None]}
df = pd.DataFrame(data)
# Remove rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)
R
library(tidyverse)
# Create the dataset
data <- data.frame(
ID = c(101, 102, 103, 104, 105),
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Age = c(25, NA, 29, 33, NA),
Salary = c(55000, 48000, NA, 62000, NA)
)
# Remove rows with missing values
cleaned_data <- na.omit(data)
print(cleaned_data)
Result:
| ID | Name | Age | Salary |
|---|---|---|---|
| 101 | Alice | 25 | 55000 |
| 104 | David | 33 | 62000 |
Method 2: Imputation
Google Sheets and Microsoft Excel

Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.
=IF(ISBLANK(C2), AVERAGE($C$2:$C$6), C2)
Python
# Impute missing values with the mean of the column for column in df.columns: if df[column].dtype == 'float64' or df[column].dtype == 'int64': df[column].fillna(df[column].mean(), inplace=True) print(df)
R
# Impute missing values with the mean data <- data %>% mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)) print(data)
Result:
| ID | Name | Age | Salary |
|---|---|---|---|
| 101 | Alice | 25 | 55000 |
| 102 | Bob | 29 | 48000 |
| 103 | Charlie | 29 | 55000 |
| 104 | David | 33 | 62000 |
| 105 | Eve | 29 | 55000 |
Method 3: Flagging
Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.
Google Sheets and Microsoft Excel
Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).
=if(ISBLANK(A2),"Missing","Not Missing")
Then, you can use this formula to create a row-based flag across all original columns:
=IF(OR(ISBLANK(A2), ISBLANK(B2), ISBLANK(C2), ISBLANK(D2)), "Missing", "Not Missing")
Python
import pandas as pd
# Create a column-based flag
for column in df.columns:
if df[column].isnull().any():
df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')
# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')
print(df)
R
library(tidyverse) # Create a column-based flag data <- data %>% mutate_at(vars(everything()), list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing"))) # Create a row-based flag data <- data %>% rowwise() %>% mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing")) print(data)
Result:
| ID | Name | Age | Salary | ID_Flagged | Name_Flagged | Age_Flagged | Salary_Flagged | Missing_Flagged |
|---|---|---|---|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | Not Missing | Not Missing | Not Missing | Not Missing | Not Missing |
| 102 | Bob | 48000 | Not Missing | Not Missing | Missing | Not Missing | Missing | |
| 103 | Charlie | 29 | Not Missing | Not Missing | Not Missing | Missing | Missing | |
| 104 | David | 33 | 62000 | Not Missing | Not Missing | Not Missing | Not Missing | Not Missing |
| 105 | Eve | Not Missing | Not Missing | Missing | Missing | Missing |
Source: Stony Brook University Libraries, https://guides.library.stonybrook.edu/data-cleaning-and-wrangling/cleaning
This work is licensed under a Creative Commons Attribution 4.0 License.
Removing Duplicates
Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:
- Exact Match: Finding rows that are completely identical across all columns.
- Partial Match: Identifying duplicates based on a subset of columns (e.g., the same name and date, but different addresses).
Let's use this dataset as an example:
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 102 | Bob | 29 | 48000 | LA |
| 106 | Alice | 27 | 59000 | LA |
| 101 | Alice | 25 | 55000 | NY |
In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.
Google Sheets

Microsoft Excel

Python
import pandas as pd
# Sample dataset
data = {'ID': [101, 102, 103, 104, 102, 106, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'],
'Age': [25, 29, 29, 33, 29, 27, 25],
'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000],
'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']}
# Original Dataset
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')
# Exact Match - based on all columns
df_cleaned = df.drop_duplicates()
print("Duplicates Removed - Exact Match")
print(df_cleaned, '\n')
# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_id = df.drop_duplicates(subset=['ID'])
print("Duplicates Removed - Partial Match based on ID")
print(df_cleaned_id, '\n')
# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_name = df.drop_duplicates(subset=['Name'])
print("Duplicates Removed - Partial Match based on Name")
print(df_cleaned_name, '\n')
# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name'])
print("Duplicates Removed - Partial Match based on ID and Name")
print(df_cleaned_specific, '\n')
R
library(tidyverse)
# Create the dataset
data <- data.frame(
ID = c(101, 102, 103, 104, 102, 106, 101),
Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"),
Age = c(25, 29, 29, 33, 29, 27, 25),
Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000),
City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY")
)
# Original Dataset
cat("Original Dataset\n")
print(data)
cat("\n")
# Exact Match - based on all columns
data_cleaned <- distinct(data)
cat("Duplicates Removed - Exact Match\n")
print(data_cleaned)
cat("\n")
# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_id <- distinct(data, ID, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID\n")
print(data_cleaned_id)
cat("\n")
# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_name <- distinct(data, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on Name\n")
print(data_cleaned_name)
cat("\n")
# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID and Name\n")
print(data_cleaned_specific)
Result:
Exact Match
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 106 | Alice | 27 | 59000 | LA |
Partial Match (ID)
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 106 | Alice | 27 | 59000 | LA |
Partial Match (Name)
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
Partial Match (ID and Name)
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 106 | Alice | 27 | 59000 | LA |
As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.
Normalizing Data Formats
Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:
- Standardizing Dates: Converting dates to a consistent format (e.g., YYYY-MM-DD).
- Text Case Standardization: Ensuring that text data is consistently in uppercase or lowercase.
- Unit Conversion: Converting measurements to a common unit (e.g., converting all weights to kilograms).
Case 1: Standardizing Dates
Let's use this dataset as an example:
| Event ID | Event Date |
|---|---|
| 1 | 12/01/2023 |
| 2 | December 2, 2023 |
| 3 | 2023-12-03 |
| 4 | 12.04.2023 |
| 5 | 5th Dec 2023 |
Google Sheets and Microsoft Excel


Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.
Python
import pandas as pd
from dateutil.parser import parse
# Sample data
data = {
'Event ID': [1, 2, 3, 4, 5],
'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')
# Custom function to parse dates
def parse_date(date_str):
try:
return parse(date_str, dayfirst=False)
except ValueError:
return pd.NaT
# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)
# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')
print("Result")
print(df)
R
library(tidyverse)
library(lubridate)
# Sample data
data <- tibble(
`Event ID` = 1:5,
`Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")
# Convert 'Event Date' to Date format
data <- data %>%
mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))
# Standardize format to YYYY-MM-DD
data <- data %>%
mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))
cat("Result")
print(data)
Result:
| Event ID | Event Date |
|---|---|
| 1 | 2023-12-01 |
| 2 | 2023-12-02 |
| 3 | 2023-12-03 |
| 4 | 2023-12-04 |
| 5 | 2023-12-05 |
Case 2: Text Case Standardization
Let's use this dataset as an example:
| Product ID | Product Name |
|---|---|
| 101 | apple |
| 102 | Banana |
| 103 | ORANGE |
| 104 | baNana |
| 105 | Apple |
Google Sheets and Microsoft Excel
Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:
=LOWER(B2)
You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.
Python
import pandas as pd
# Sample data
data = {
'Product ID': [101, 102, 103, 104, 105],
'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}
df = pd.DataFrame(data)
# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()
print(df)
R
library(tidyverse)
# Sample data
data <- tibble(
`Product ID` = c(101, 102, 103, 104, 105),
`Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)
# Convert 'Product Name' to lowercase
data <- data %>%
mutate(`Product Name` = tolower(`Product Name`))
print(data)
Result:
| Product ID | Product Name |
|---|---|
| 101 | apple |
| 102 | banana |
| 103 | orange |
| 104 | banana |
| 105 | apple |
Case 3: Unit Conversion
Let's use this dataset as an example:
| Sample ID | Weight |
|---|---|
| A1 | 70 kg |
| A2 | 154 lbs |
| A3 | 68 KG |
| A4 | 160 pound |
| A5 | 72 kilograms |
Google Sheets and Microsoft Excel

Formulas:
1. Extracting numeric value from Weight (Column B) and placing it in Column C:
Google Sheets (using REGEXEXTRACT)
=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))
For both Google Sheets and Microsoft Excel:
=VALUE(LEFT(B2, FIND(" ", B2) - 1))
2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:
Google Sheets (using REGEXEXTRACT)
=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))
For both Google Sheets and Microsoft Excel:
=LOWER(TRIM(MID(B2, FIND(" ", B2) + 1, LEN(B2))))
3. Converting the weight to kg based on the original unit and placing it in Column E:
For both Google Sheets and Microsoft Excel:
=IF(OR(D2="kg", D2="kilogram", D2="kilograms"),C2, IF(OR(D2="lb", D2="lbs", D2="pound", D2="pounds"), CONVERT(C2, "lbm", "kg"), "Unknown Unit"))
Python
import pandas as pd
# Sample data
data = {
'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}
df = pd.DataFrame(data)
# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)
# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series
# Conversion function
def convert_to_kg(row):
if row['Unit'] in ['kg', 'kilograms']:
return row['Numeric Value']
elif row['Unit'] in ['lbs', 'pound']:
return row['Numeric Value'] * 0.453592
else:
return None # or handle unknown units
df['Weight in kg'] = df.apply(convert_to_kg, axis=1)
print(df[['Sample ID', 'Weight', 'Weight in kg']])
R
library(tidyverse)
# Sample data
data <- tibble(
`Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)
# Extract numeric value and unit
data <- data %>%
mutate(
`Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
)
# Conversion function
convert_to_kg <- function(value, unit) {
if (unit %in% c('kg', 'kilograms')) {
return(value)
} else if (unit %in% c('lbs', 'pound')) {
return(value * 0.453592)
} else {
return(NA)
}
}
# Apply conversion
data <- data %>%
rowwise() %>%
mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
ungroup()
print(data %>% select(`Sample ID`, Weight, `Weight in kg`))
Result:
| Sample ID | Weight | Numeric Value | Unit | Weight_kg |
|---|---|---|---|---|
| A1 | 70 kg | 70 | kg | 70 |
| A2 | 154 lbs | 154 | lbs | 69.85322498 |
| A3 | 68 KG | 68 | kg | 68 |
| A4 | 160 pound | 160 | pound | 72.5747792 |
| A5 | 72 kilograms | 72 | kilograms | 72 |
Outlier Detection and Handling
Outliers are data points that deviate significantly from other observations in the dataset. They can be the result of errors or true anomalies, and handling them appropriately is essential for accurate analysis. Below are some common methods for handling outliers:
- Visual Inspection: Using plots like box plots or scatter plots to identify outliers visually.
- Statistical Methods: Using standard deviations, interquartile range (IQR), or z-scores to detect outliers.
- Decision on Handling: Decide whether to keep, remove, or transform outliers based on their impact on the analysis.
Let's use this annual income dataset as an example:
| Person ID | Income ($) |
|---|---|
| 1 | 80,000 |
| 2 | 95,000 |
| 3 | 115,000 |
| 4 | 75,000 |
| 5 | 110,000 |
| 6 | 85,000 |
| 7 | 100,000 |
| 8 | 199,000 |
| 9 | 105,000 |
| 10 | 90,000 |
Visual Inspection
Python
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'Person ID': range(1, 11),
'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}
df = pd.DataFrame(data)
# Box Plot
plt.figure(figsize=(8, 4))
plt.boxplot(df['Income'], vert=False, showfliers=True)
plt.title('Box Plot of Incomes')
plt.xlabel('Income ($000)')
plt.show()
# Scatter Plot
plt.figure(figsize=(8, 4))
plt.scatter(df['Person ID'], df['Income'])
plt.title('Scatter Plot of Incomes')
plt.xlabel('Person ID')
plt.ylabel('Income ($000)')
plt.show()
R
library(tidyverse)
# Sample data
data <- tibble(
`Person ID` = 1:10,
`Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)
# Box Plot
ggplot(data, aes(y = Income)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8) +
ggtitle("Box Plot of Incomes") +
ylab("Income ($000)")
# Scatter Plot
ggplot(data, aes(x = `Person ID`, y = Income)) +
geom_point() +
ggtitle("Scatter Plot of Incomes") +
xlab("Person ID") +
ylab("Income ($000)")


Analysis:
Box Plot:
- The box plot shows the distribution of incomes where the income of $198,000 appears as an outlier above the upper whisker.
Scatter Plot:
- The scatter plot displays Person 8's income distinctly higher than others.
Statistical Methods
Python
import pandas as pd
# Sample data
data = {
'Person ID': range(1, 11),
'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}
df = pd.DataFrame(data)
# Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
# Calculate bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
df['Outlier'] = df['Income'].apply(lambda x: 'Outlier' if x > upper_bound else 'Normal')
print(df)
R
library(tidyverse) # Sample data data <- tibble( `Person ID` = 1:10, `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000) ) # Calculate Q1 and Q3 Q1 <- quantile(data$Income, 0.25) Q3 <- quantile(data$Income, 0.75) IQR <- Q3 - Q1 # Calculate bounds lower_bound <- Q1 - 1.5 * IQR upper_bound <- Q3 + 1.5 * IQR # Identify outliers data <- data %>% mutate(Outlier = ifelse(Income > upper_bound, "Outlier", "Normal")) print(data)
Results:
| Person ID | Income ($) | Outlier |
|---|---|---|
| 1 | 80,000 | Normal |
| 2 | 95,000 | Normal |
| 3 | 115,000 | Normal |
| 4 | 75,000 | Normal |
| 5 | 110,000 | Normal |
| 6 | 85,000 | Normal |
| 7 | 100,000 | Normal |
| 8 | 199,000 | Outlier |
| 9 | 105,000 | Normal |
| 10 | 90,000 | Normal |
Decision on Handling Outliers
Let's assess the impact of the identified outlier on the overall dataset.
Python
# Mean and Standard Deviation with Outlier
mean_with_outlier = df['Income'].mean()
std_with_outlier = df['Income'].std()
# Mean and Standard Deviation without Outlier
df_clean = df[df['Outlier'] == 'Normal']
mean_without_outlier = df_clean['Income'].mean()
std_without_outlier = df_clean['Income'].std()
print("Income with Outlier")
print(f"Mean: ${mean_with_outlier:,.2f}")
print(f"SD: ${std_with_outlier:,.2f}")
print("Income without Outlier")
print(f"Mean: ${mean_without_outlier:,.2f}")
print(f"SD: ${std_without_outlier:,.2f}")
R
# Mean and Standard Deviation with Outlier
mean_with_outlier <- mean(df$Income)
std_with_outlier <- sd(df$Income)
# Filter out the outliers
df_clean <- df %>% filter(Outlier == "Normal")
# Mean and Standard Deviation without Outlier
mean_without_outlier <- mean(df_clean$Income)
std_without_outlier <- sd(df_clean$Income)
# Print results
cat("Income with Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_with_outlier))
cat(sprintf("SD: $%.2f\n", std_with_outlier))
cat("\nIncome without Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_without_outlier))
cat(sprintf("SD: $%.2f\n", std_without_outlier))
Results:
Income with Outlier
Mean: $105,400.00
SD: $35,330.82
Income without Outlier
Mean: $95,000.00
SD: $13,693.06
As you can see, the outlier increases the mean by $10,400 and the standard deviation by $21,637.76. Based on this finding, you will have three options of action:
- Keep the Outlier
- This is an option if you can verify the income is accurate and the person doesn't represent a different income group (e.g., executive level instead of regular level)
- This is also a good option if this data point in important for your purpose (e.g., showing income distribution and inequality)
- Remove the Outlier
- This is an option if the outlier skews the data and is not representative.
- This is also appropriate for analyses focusing on the typical income range.
- Transform the Data
- Applying a logarithmic transformation can help reduce skewness in your data distribution
- This is another option that allows inclusion of the outlier while minimizing its impact
- Transformed data might be more difficult to interpret
In the end, the decision on how to handle outliers is yours, as long as it is based on a thoughtful and informed rationale. Whether you choose to keep, remove, or transform outliers should depend on the context of your analysis and the impact these values have on your results. Outliers can provide valuable insights or skew your data, so it is important to consider the nature of your data, the objectives of your analysis, and how outliers affect the accuracy and interpretation of your findings. Ultimately, your approach should align with your goals, ensuring that the conclusions you draw are both robust and credible.
Correcting Data Entry Errors
Data entry errors are a common challenge, particularly in manual data collection processes. Left unaddressed, these errors can skew analyses, lead to incorrect conclusions, and compromise the integrity of your dataset. Therefore, it is essential to identify and correct them to ensure data accuracy and reliability. Below are some common methods for detecting and fixing data entry errors:
1. Validation Rules
Validation rules help prevent errors before they even occur by setting up conditions that data must meet. This method ensures that values entered into the system are consistent, logical, and fall within expected ranges. Examples of validation rules include:
- Date of Birth cannot be a future date.
- Age should be within a reasonable range (e.g., 0 to 120 years).
- Email Address should follow the standard format (e.g., name@domain.com).
Some examples of validation rules are:
Google Sheets

Microsoft Excel

Python
import pandas as pd from datetime import datetime
# Sample data
data = {
'Name': ['John Doe', 'Jane Smith', 'Alice Johnson'],
'DOB': ['1990-01-15', '2094-05-10', '1985-07-23'] }df = pd.DataFrame(data)
# Convert 'DOB' to datetime and flag future dates
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')
df['Valid DOB'] = df['DOB'] <= pd.Timestamp(datetime.today())print(df)
R
library(tidyverse)
library(lubridate)# Sample data
data <- tibble(
Name = c("John Doe", "Jane Smith", "Alice Johnson"),
DOB = c("1990-01-15", "2094-05-10", "1985-07-23") )# Convert 'DOB' to datetime and flag future dates data <- data %>% mutate(DOB = ymd(DOB),
`Valid DOB` = DOB <= today())print(data)
Results:
| Name | DOB | Valid DOB |
|---|---|---|
| John Doe | 1990-01-15 0:00:00 | TRUE |
| Jane Smith | 2094-05-10 0:00:00 | FALSE |
| Alice Johnson | 1985-07-23 0:00:00 | TRUE |
2. Manual Review
For key variables, especially in small datasets, a manual review of entries is often necessary. This method involves going through the dataset by hand to identify inconsistencies, missing values, or obvious mistakes. While time-consuming, manual review is highly effective in catching subtle errors that automated tools might miss, such as transposed digits (e.g., entering "1987" instead of "1897" for a birth year).
3. Automated Checks
Automated checks leverage software tools to automatically identify and flag common data entry errors. These checks can be programmed to detect outliers, incorrect formats, missing values, or other anomalies. Examples of automated checks include:
- Range Checks: Ensuring all values fall within predefined limits.
- Consistency Checks: Verifying that related fields make sense together (e.g., the “end date” of an event cannot be earlier than the “start date”).
- Format Validation: Check that data entries adhere to a specific format (e.g., email addresses must include "@" and a domain).
- Spelling and Variations Detection: Automated tools can identify spelling errors and variations in text data (e.g., “USA,” “United States,” “U.S.”). This ensures consistency in categorical data, which can otherwise lead to inaccurate groupings or analysis. Some tools can flag these inconsistencies or automatically correct them to a predefined standard.
Automated tools like Apps Script in Google Sheets, Macros in Microsoft Excel, Python, and R allow you to perform these checks efficiently, especially for larger datasets. Python and R, in particular, offer robust libraries for data validation and error correction.

