Removing Duplicates

Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:

Exact Match: Finding rows that are completely identical across all columns.
Partial Match: Identifying duplicates based on a subset of columns (e.g., the same name and date, but different addresses).

Let's use this dataset as an example:

Employee Records Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
102	Bob	29	48000	LA
106	Alice	27	59000	LA
101	Alice	25	55000	NY

In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.

Google Sheets

Microsoft Excel

Python

import pandas as pd

# Sample dataset
data = {'ID': [101, 102, 103, 104, 102, 106, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'],
        'Age': [25, 29, 29, 33, 29, 27, 25],
        'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000],
        'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']}

# Original Dataset
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')

# Exact Match - based on all columns
df_cleaned = df.drop_duplicates()
print("Duplicates Removed - Exact Match")
print(df_cleaned, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_id = df.drop_duplicates(subset=['ID'])
print("Duplicates Removed - Partial Match based on ID")
print(df_cleaned_id, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_name = df.drop_duplicates(subset=['Name'])
print("Duplicates Removed - Partial Match based on Name")
print(df_cleaned_name, '\n')

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name'])
print("Duplicates Removed - Partial Match based on ID and Name")
print(df_cleaned_specific, '\n')

R

library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 102, 106, 101),
  Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"),
  Age = c(25, 29, 29, 33, 29, 27, 25),
  Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000),
  City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY")
)

# Original Dataset
cat("Original Dataset\n")
print(data)
cat("\n")

# Exact Match - based on all columns
data_cleaned <- distinct(data)
cat("Duplicates Removed - Exact Match\n")
print(data_cleaned)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_id <- distinct(data, ID, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID\n")
print(data_cleaned_id)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_name <- distinct(data, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on Name\n")
print(data_cleaned_name)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID and Name\n")
print(data_cleaned_specific)

Result:

Exact Match

Employee Information Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
106	Alice	27	59000	LA

Partial Match (ID)

Employee Data Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
106	Alice	27	59000	LA

Partial Match (Name)

Employee Information Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF

Partial Match (ID and Name)

Employee Data Table

ID	Name	Age	Salary	City
101	Alice	25	55000	NY
102	Bob	29	48000	LA
103	Charlie	29	55000	NY
104	David	33	62000	SF
106	Alice	27	59000	LA

As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.