Removing Duplicates
Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:
- Exact Match: Finding rows that are completely identical across all columns.
- Partial Match: Identifying duplicates based on a subset of columns (e.g., the same name and date, but different addresses).
Let's use this dataset as an example:
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 102 | Bob | 29 | 48000 | LA |
| 106 | Alice | 27 | 59000 | LA |
| 101 | Alice | 25 | 55000 | NY |
In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.
Google Sheets

Microsoft Excel

Python
import pandas as pd
# Sample dataset
data = {'ID': [101, 102, 103, 104, 102, 106, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'],
'Age': [25, 29, 29, 33, 29, 27, 25],
'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000],
'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']}
# Original Dataset
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')
# Exact Match - based on all columns
df_cleaned = df.drop_duplicates()
print("Duplicates Removed - Exact Match")
print(df_cleaned, '\n')
# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_id = df.drop_duplicates(subset=['ID'])
print("Duplicates Removed - Partial Match based on ID")
print(df_cleaned_id, '\n')
# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_name = df.drop_duplicates(subset=['Name'])
print("Duplicates Removed - Partial Match based on Name")
print(df_cleaned_name, '\n')
# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name'])
print("Duplicates Removed - Partial Match based on ID and Name")
print(df_cleaned_specific, '\n')
R
library(tidyverse)
# Create the dataset
data <- data.frame(
ID = c(101, 102, 103, 104, 102, 106, 101),
Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"),
Age = c(25, 29, 29, 33, 29, 27, 25),
Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000),
City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY")
)
# Original Dataset
cat("Original Dataset\n")
print(data)
cat("\n")
# Exact Match - based on all columns
data_cleaned <- distinct(data)
cat("Duplicates Removed - Exact Match\n")
print(data_cleaned)
cat("\n")
# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_id <- distinct(data, ID, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID\n")
print(data_cleaned_id)
cat("\n")
# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_name <- distinct(data, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on Name\n")
print(data_cleaned_name)
cat("\n")
# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID and Name\n")
print(data_cleaned_specific)
Result:
Exact Match
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 106 | Alice | 27 | 59000 | LA |
Partial Match (ID)
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 106 | Alice | 27 | 59000 | LA |
Partial Match (Name)
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
Partial Match (ID and Name)
| ID | Name | Age | Salary | City |
|---|---|---|---|---|
| 101 | Alice | 25 | 55000 | NY |
| 102 | Bob | 29 | 48000 | LA |
| 103 | Charlie | 29 | 55000 | NY |
| 104 | David | 33 | 62000 | SF |
| 106 | Alice | 27 | 59000 | LA |
As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.