Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.
Duplicate Entries
Duplicate entries are erroneous entries in a dataset that contain the exact same (or nearly the same) information. A good and common example of erroneous duplicate entries occurs in online surveys where a respondent accidentally submits the same survey multiple times. Duplicate entries can occur in all kinds of data for all kinds of reasons and can be difficult to detect if the dataset also contains legitimate entries that have identical values.
Let's take a look at an example. We have a few questions in the survey where we would not expect to find two separate entries with the exact same value.
# value_counts for 'quotes', where the count is greater than 1 df.loc[:, 'quote'] \ .value_counts() \ .loc[df.loc[:, 'quote'] \ .value_counts() > 1]
ream bigger 4 You miss 100% of the shots you don't take 2 “Every problem is a gift—without problems we would not grow.” – Anthony Robbinsn 2 I think, therefore I am. 2 Let's never be the kind of people who do things lukewarmly. - David McCollough 2 Name: quote, dtype: int64
A few of these might be legitimate repetitions, but it seems very unlikely that all of them are. Instead, we might suspect that a few survey respondents simply clicked the submit button more than once.
pandas offers some tools that allow us to detect and deal with duplicates in a more sophisticated way than just looking at value counts.
Finding Duplicates
The .duplicated()
method can be applied to an entire dataframe, a
series, or an index. It will look for exact duplicates and return a
boolean object where duplicates are marked True
. By default, the
first occurrence is not marked as a duplicate, and all other
occurrences are. This behavior can be changed with an optional argument.
# assign the series to a variable so the code is a bit easier to read quote_series = df.loc[:, 'quote'] # here are all the rows that contain 'Dream bigger' quote_series.loc[quote_series.values == 'Dream bigger']
144 Dream bigger 148 Dream bigger 150 Dream bigger 153 Dream bigger Name: quote, dtype: object
# create a duplicate mask (rows with duplicates marked as True) duplicate_mask = quote_series.duplicated() # create a view of only duplicates # non-duplicates appear in the series as missing values, so we drop missing # and only have the duplicates left quote_series.loc[duplicate_mask].dropna()
80 Let's never be the kind of people who do thing... 82 I think, therefore I am. 148 Dream bigger 150 Dream bigger 153 Dream bigger 176 You miss 100% of the shots you don't take 198 “Every problem is a gift—without problems we w... Name: quote, dtype: object
Notice that this first instance of 'Dream bigger' with the index value
of 144
was not marked as a duplicate. .duplicated()
is only
flagging repeats after the first occurrence.
Dropping Duplicate Entries
If we can confirm we have erroneous duplicates, we will usually want to
remove them before we do data analysis. The .drop_duplicates()
method allows you to remove duplicates from dataframes, series, or an
index. By default, it keeps the first occurrence and drops all the rest.
This behavior can be changed to only keep the last or to drop all
duplicates.
# drop duplicates when applied to a series print(f'Series Length before drop: {len(quote_series)}') print(f'Series Length after drop: {len(quote_series.drop_duplicates())}')
Series Length before drop: 204 Series Length after drop: 183
If we use .drop_duplicates()
with a dataframe, there are a few more
useful arguments that can be applied to change the method's behavior.
# drop duplicates applied to a dataframe print(f'Dataframe Length before drop: {len(df)}') print(f'Dataframe Length after drop: {len(df.drop_duplicates())}')
Dataframe Length before drop: 204 Dataframe Length after drop: 204
In the case of our survey data, there is a timestamp column that
registers the time the survey was submitted. Since duplicate surveys
were submitted by repeatedly hitting the submit button, the timestamp is
not exactly the same, and thus these rows are not dropped. The
subset=
argument can be used to specify column labels to consider
when testing for duplicates. In the code below, we specify all of the
column labels after the first, which is the timestamp.
# drop duplicates applied to a dataframe l_before = len(df) l_after = len(df.drop_duplicates(subset = df.columns[1:])) print(f'Dataframe length before drop: {l_before}') print(f'Dataframe length after drop: {l_after}')
Dataframe length before drop: 204 Dataframe length after drop: 199
# dream bigger without the drop df.loc[df.loc[:, 'quote'] == "Dream bigger"]
Timestamp | musicartist | height | city | 30min | travel | likepizza | deepdish | sport | spell | hangout | talk | year | quote | likepizza_w_replace | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
144 | 8/17/2020 15:16:29 | Taylor Swift | 5.6ft | Paris | Warren Buffett | Iceland | 3.0 | no opinion | Basketball | hors d'oeuvre | 3 | More | The year when I'm 30 | Dream bigger | 3.0 |
148 | 8/17/2020 19:35:03 | Taylor Swift | 5.6ft | Paris | Warren Buffett | Iceland | 3.0 | no opinion | Basketball | hors d'oeuvre | 3 | More | The year when I'm 30 | Dream bigger | 3.0 |
150 | 8/17/2020 21:56:47 | Taylor Swift | 5.6ft | Paris | Warren Buffett | Iceland | 3.0 | no opinion | Basketball | hors d'oeuvre | 3 | More | The year when I'm 30 | Dream bigger | 3.0 |
153 | 8/21/2020 16:30:37 | Taylor Swift | 5.6ft | Paris | Warren Buffett | Iceland | 3.0 | no opinion | Basketball | hors d'oeuvre | 3 | More | The year when I'm 30 | Dream bigger | 3.0 |
# dream bigger with the drop df.drop_duplicates(subset = df.columns[1:]) \ .loc[df.loc[:, 'quote'] == "Dream bigger"]
Timestamp | musicartist | height | city | 30min | travel | likepizza | deepdish | sport | spell | hangout | talk | year | quote | likepizza_w_replace | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
144 | 8/17/2020 15:16:29 | Taylor Swift | 5.6ft | Paris | Warren Buffett | Iceland | 3.0 | no op |