Duplicate Entries

Duplicate entries are erroneous entries in a dataset that contain the exact same (or nearly the same) information. A good and common example of erroneous duplicate entries occurs in online surveys where a respondent accidentally submits the same survey multiple times. Duplicate entries can occur in all kinds of data for all kinds of reasons and can be difficult to detect if the dataset also contains legitimate entries that have identical values.

Let's take a look at an example. We have a few questions in the survey where we would not expect to find two separate entries with the exact same value.

# value_counts for 'quotes', where the count is greater than 1
df.loc[:, 'quote'] \
  .value_counts() \
  .loc[df.loc[:, 'quote'] \
  .value_counts() > 1]
ream bigger                                                                           4
You miss 100% of the shots you don't take                                             2
“Every problem is a gift—without problems we would not grow.” – Anthony Robbinsn      2
I think, therefore I am.                                                              2
Let's never be the kind of people who do things lukewarmly. - David McCollough        2
Name: quote, dtype: int64

A few of these might be legitimate repetitions, but it seems very unlikely that all of them are. Instead, we might suspect that a few survey respondents simply clicked the submit button more than once.

pandas offers some tools that allow us to detect and deal with duplicates in a more sophisticated way than just looking at value counts.


Finding Duplicates

The .duplicated() method can be applied to an entire dataframe, a series, or an index. It will look for exact duplicates and return a boolean object where duplicates are marked True. By default, the first occurrence is not marked as a duplicate, and all other occurrences are. This behavior can be changed with an optional argument.

# assign the series to a variable so the code is a bit easier to read
quote_series = df.loc[:, 'quote']

# here are all the rows that contain 'Dream bigger'
quote_series.loc[quote_series.values == 'Dream bigger']
144    Dream bigger
148    Dream bigger
150    Dream bigger
153    Dream bigger
Name: quote, dtype: object
# create a duplicate mask (rows with duplicates marked as True)
duplicate_mask = quote_series.duplicated()

# create a view of only duplicates
# non-duplicates appear in the series as missing values, so we drop missing
# and only have the duplicates left
quote_series.loc[duplicate_mask].dropna()
80     Let's never be the kind of people who do thing...
82                              I think, therefore I am.
148                                         Dream bigger
150                                         Dream bigger
153                                         Dream bigger
176            You miss 100% of the shots you don't take
198    Every problem is a giftwithout problems we w...
Name: quote, dtype: object

Notice that this first instance of 'Dream bigger' with the index value of 144 was not marked as a duplicate. .duplicated() is only flagging repeats after the first occurrence.


Dropping Duplicate Entries

If we can confirm we have erroneous duplicates, we will usually want to remove them before we do data analysis. The .drop_duplicates() method allows you to remove duplicates from dataframes, series, or an index. By default, it keeps the first occurrence and drops all the rest. This behavior can be changed to only keep the last or to drop all duplicates.

# drop duplicates when applied to a series
print(f'Series Length before drop: {len(quote_series)}')
print(f'Series Length after drop: {len(quote_series.drop_duplicates())}')
 Series Length before drop: 204
Series Length after drop: 183

If we use .drop_duplicates() with a dataframe, there are a few more useful arguments that can be applied to change the method's behavior.

# drop duplicates applied to a dataframe
print(f'Dataframe Length before drop: {len(df)}')
print(f'Dataframe Length after drop: {len(df.drop_duplicates())}')
Dataframe Length before drop: 204
Dataframe Length after drop: 204

In the case of our survey data, there is a timestamp column that registers the time the survey was submitted. Since duplicate surveys were submitted by repeatedly hitting the submit button, the timestamp is not exactly the same, and thus these rows are not dropped. The subset= argument can be used to specify column labels to consider when testing for duplicates. In the code below, we specify all of the column labels after the first, which is the timestamp.

# drop duplicates applied to a dataframe

l_before = len(df)
l_after  = len(df.drop_duplicates(subset = df.columns[1:]))

print(f'Dataframe length before drop: {l_before}')
print(f'Dataframe length after drop: {l_after}')
 Dataframe length before drop: 204
Dataframe length after drop: 199
# dream bigger without the drop
df.loc[df.loc[:, 'quote'] == "Dream bigger"]

Timestamp musicartist height city 30min travel likepizza deepdish sport spell hangout talk year quote likepizza_w_replace
144 8/17/2020 15:16:29 Taylor Swift 5.6ft Paris Warren Buffett Iceland 3.0 no opinion Basketball hors d'oeuvre 3 More The year when I'm 30 Dream bigger 3.0
148 8/17/2020 19:35:03 Taylor Swift 5.6ft Paris Warren Buffett Iceland 3.0 no opinion Basketball hors d'oeuvre 3 More The year when I'm 30 Dream bigger 3.0
150 8/17/2020 21:56:47 Taylor Swift 5.6ft Paris Warren Buffett Iceland 3.0 no opinion Basketball hors d'oeuvre 3 More The year when I'm 30 Dream bigger 3.0
153 8/21/2020 16:30:37 Taylor Swift 5.6ft Paris Warren Buffett Iceland 3.0 no opinion Basketball hors d'oeuvre 3 More The year when I'm 30 Dream bigger 3.0

 # dream bigger with the drop
df.drop_duplicates(subset = df.columns[1:]) \
  .loc[df.loc[:, 'quote'] == "Dream bigger"]

Timestamp musicartist height city 30min travel likepizza deepdish sport spell hangout talk year quote likepizza_w_replace
144 8/17/2020 15:16:29 Taylor Swift 5.6ft Paris Warren Buffett Iceland 3.0 no op