Outlier Detection and Handling

Outliers are data points that deviate significantly from other observations in the dataset. They can be the result of errors or true anomalies, and handling them appropriately is essential for accurate analysis. Below are some common methods for handling outliers:

Visual Inspection: Using plots like box plots or scatter plots to identify outliers visually.
Statistical Methods: Using standard deviations, interquartile range (IQR), or z-scores to detect outliers.
Decision on Handling: Decide whether to keep, remove, or transform outliers based on their impact on the analysis.

Let's use this annual income dataset as an example:

Sample Income Data by Person ID
Person ID	Income ($)
1	80,000
2	95,000
3	115,000
4	75,000
5	110,000
6	85,000
7	100,000
8	199,000
9	105,000
10	90,000

Visual Inspection

Python

import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {
    'Person ID': range(1, 11),
    'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}

df = pd.DataFrame(data)

# Box Plot
plt.figure(figsize=(8, 4))
plt.boxplot(df['Income'], vert=False, showfliers=True)
plt.title('Box Plot of Incomes')
plt.xlabel('Income ($000)')
plt.show()

# Scatter Plot
plt.figure(figsize=(8, 4))
plt.scatter(df['Person ID'], df['Income'])
plt.title('Scatter Plot of Incomes')
plt.xlabel('Person ID')
plt.ylabel('Income ($000)')
plt.show()

R

library(tidyverse)
# Sample data
data <- tibble(
  `Person ID` = 1:10,
  `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)

# Box Plot
ggplot(data, aes(y = Income)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8) +
  ggtitle("Box Plot of Incomes") +
  ylab("Income ($000)")

# Scatter Plot
ggplot(data, aes(x = `Person ID`, y = Income)) +
  geom_point() +
  ggtitle("Scatter Plot of Incomes") +
  xlab("Person ID") +
  ylab("Income ($000)")

A box plot graph of incomes. The box spans from about 80k to 110k, a whisker above the box and a data point outlier at 200K. A scatter plot of income versus person ID. Most incomes cluster around $75k-$125k, with one person earning over $190k.

Analysis:

Box Plot:

The box plot shows the distribution of incomes where the income of $198,000 appears as an outlier above the upper whisker.

Scatter Plot:

The scatter plot displays Person 8's income distinctly higher than others.

Statistical Methods

Python

import pandas as pd

# Sample data
data = {
    'Person ID': range(1, 11),
    'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}

df = pd.DataFrame(data)

# Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1                   

# Calculate bounds
lower_bound = Q1 - 1.5 * IQR    
upper_bound = Q3 + 1.5 * IQR    

# Identify outliers
df['Outlier'] = df['Income'].apply(lambda x: 'Outlier' if x > upper_bound else 'Normal')

print(df)

library(tidyverse)

# Sample data
data <- tibble(
  `Person ID` = 1:10,
  `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)

# Calculate Q1 and Q3
Q1 <- quantile(data$Income, 0.25)
Q3 <- quantile(data$Income, 0.75)
IQR <- Q3 - Q1                   

# Calculate bounds
lower_bound <- Q1 - 1.5 * IQR    
upper_bound <- Q3 + 1.5 * IQR    

# Identify outliers
data <- data %>%
  mutate(Outlier = ifelse(Income > upper_bound, "Outlier", "Normal"))

print(data)

Results:

Income Data with Outlier Detection
Person ID	Income ($)	Outlier
1	80,000	Normal
2	95,000	Normal
3	115,000	Normal
4	75,000	Normal
5	110,000	Normal
6	85,000	Normal
7	100,000	Normal
8	199,000	Outlier
9	105,000	Normal
10	90,000	Normal

Decision on Handling Outliers

Let's assess the impact of the identified outlier on the overall dataset.

Python

# Mean and Standard Deviation with Outlier
mean_with_outlier = df['Income'].mean()
std_with_outlier = df['Income'].std()

# Mean and Standard Deviation without Outlier
df_clean = df[df['Outlier'] == 'Normal']
mean_without_outlier = df_clean['Income'].mean()
std_without_outlier = df_clean['Income'].std()

print("Income with Outlier")
print(f"Mean: ${mean_with_outlier:,.2f}")
print(f"SD: ${std_with_outlier:,.2f}")
print("Income without Outlier")
print(f"Mean: ${mean_without_outlier:,.2f}")
print(f"SD: ${std_without_outlier:,.2f}")

# Mean and Standard Deviation with Outlier
mean_with_outlier <- mean(df$Income)
std_with_outlier <- sd(df$Income)

# Filter out the outliers
df_clean <- df %>% filter(Outlier == "Normal")

# Mean and Standard Deviation without Outlier
mean_without_outlier <- mean(df_clean$Income)
std_without_outlier <- sd(df_clean$Income)

# Print results
cat("Income with Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_with_outlier))
cat(sprintf("SD: $%.2f\n", std_with_outlier))

cat("\nIncome without Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_without_outlier))
cat(sprintf("SD: $%.2f\n", std_without_outlier))

Results:

Income with Outlier
Mean: $105,400.00
SD: $35,330.82
Income without Outlier
Mean: $95,000.00
SD: $13,693.06

As you can see, the outlier increases the mean by $10,400 and the standard deviation by $21,637.76. Based on this finding, you will have three options of action:

Keep the Outlier
- This is an option if you can verify the income is accurate and the person doesn't represent a different income group (e.g., executive level instead of regular level)
- This is also a good option if this data point in important for your purpose (e.g., showing income distribution and inequality)
Remove the Outlier
- This is an option if the outlier skews the data and is not representative.
- This is also appropriate for analyses focusing on the typical income range.
Transform the Data
- Applying a logarithmic transformation can help reduce skewness in your data distribution
- This is another option that allows inclusion of the outlier while minimizing its impact
- Transformed data might be more difficult to interpret

In the end, the decision on how to handle outliers is yours, as long as it is based on a thoughtful and informed rationale. Whether you choose to keep, remove, or transform outliers should depend on the context of your analysis and the impact these values have on your results. Outliers can provide valuable insights or skew your data, so it is important to consider the nature of your data, the objectives of your analysis, and how outliers affect the accuracy and interpretation of your findings. Ultimately, your approach should align with your goals, ensuring that the conclusions you draw are both robust and credible.