Data Cleaning
Dealing with Whitespace in Strings
When humans are entering data, they will often insert extraneous
whitespace (things like spaces, tabs, or returns). To properly group or
count such data, we need to 'strip' away the extra whitespace. If we
don't, Python will treat the string 'football' as different from the
string ' football' or 'football'.
The string method .strip()
will remove extraneous whitespace from
before and after a string. The equivalent in Pandas is the method
.str.strip()
.
sport_series = df.loc[:, 'sport'] \ .dropna() \ .str.lower() contains_football_mask = sport_series.str.contains('football') sport_series[contains_football_mask].value_counts()
football 40 football 4 basketball, football, baseball 2 soccer (the real football) 1 football or wrestling 1 college football 1 if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff 1 football (mainstream) or something out there like rock climbing 1 football/basketball 1 Name: sport, dtype: int64
Notice above that the first and second entries both appear to be
'football'
, but they are not the same in Python. .str.strip()
will help us out here.
# using str.strip() to remove whitespace sport_series.loc[contains_football_mask] \ .str.strip() \ .value_counts()
football 44 basketball, football, baseball 2 soccer (the real football) 1 football or wrestling 1 college football 1 if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff 1 football (mainstream) or something out there like rock climbing 1 football/basketball 1 Name: sport, dtype: int64
If you need to be a bit more careful about how you are stripping
whitespace, the functions str.lstrip()
and str.rstrip()
are
available to just strip whitespace on the left or on the right,
respectively.