Removing Duplicates

Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:

  1. Exact Match: Finding rows that are completely identical across all columns.
  2. Partial Match: Identifying duplicates based on a subset of columns (e.g., the same name and date, but different addresses).

Let's use this dataset as an example: 

Employee Records Table
ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
102 Bob 29 48000 LA
106 Alice 27 59000 LA
101 Alice 25 55000 NY


In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.

 

Google Sheets

Google Sheets

Microsoft Excel

Microsoft Excel

 

Python
import pandas as pd

# Sample dataset
data = {'ID': [101, 102, 103, 104, 102, 106, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'],
        'Age': [25, 29, 29, 33, 29, 27, 25],
        'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000],
        'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']}

# Original Dataset
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')

# Exact Match - based on all columns
df_cleaned = df.drop_duplicates()
print("Duplicates Removed - Exact Match")
print(df_cleaned, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_id = df.drop_duplicates(subset=['ID'])
print("Duplicates Removed - Partial Match based on ID")
print(df_cleaned_id, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_name = df.drop_duplicates(subset=['Name'])
print("Duplicates Removed - Partial Match based on Name")
print(df_cleaned_name, '\n')

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name'])
print("Duplicates Removed - Partial Match based on ID and Name")
print(df_cleaned_specific, '\n')

 

R
library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 102, 106, 101),
  Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"),
  Age = c(25, 29, 29, 33, 29, 27, 25),
  Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000),
  City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY")
)

# Original Dataset
cat("Original Dataset\n")
print(data)
cat("\n")

# Exact Match - based on all columns
data_cleaned <- distinct(data)
cat("Duplicates Removed - Exact Match\n")
print(data_cleaned)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_id <- distinct(data, ID, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID\n")
print(data_cleaned_id)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_name <- distinct(data, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on Name\n")
print(data_cleaned_name)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID and Name\n")
print(data_cleaned_specific)

 

Result:

Exact Match

Employee Information Table
ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
106 Alice 27 59000 LA

 

Partial Match (ID)

Employee Data Table
ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
106 Alice 27 59000 LA

Partial Match (Name)

Employee Information Table
ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF

Partial Match (ID and Name)

Employee Data Table
ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
106 Alice 27 59000 LA


As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.