Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.
Dealing with Whitespace in Strings
The string method .strip()
will remove extraneous whitespace from
before and after a string. The equivalent in Pandas is the method
.str.strip()
.
sport_series = df.loc[:, 'sport'] \ .dropna() \ .str.lower() contains_football_mask = sport_series.str.contains('football') sport_series[contains_football_mask].value_counts()
football 40 football 4 basketball, football, baseball 2 soccer (the real football) 1 football or wrestling 1 college football 1 if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff 1 football (mainstream) or something out there like rock climbing 1 football/basketball 1 Name: sport, dtype: int64
Notice above that the first and second entries both appear to be
'football'
, but they are not the same in Python. .str.strip()
will help us out here.
# using str.strip() to remove whitespace sport_series.loc[contains_football_mask] \ .str.strip() \ .value_counts()
football 44 basketball, football, baseball 2 soccer (the real football) 1 football or wrestling 1 college football 1 if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff 1 football (mainstream) or something out there like rock climbing 1 football/basketball 1 Name: sport, dtype: int64
If you need to be a bit more careful about how you are stripping
whitespace, the functions str.lstrip()
and str.rstrip()
are
available to just strip whitespace on the left or on the right,
respectively.