Dealing with Whitespace in Strings

When humans are entering data, they will often insert extraneous whitespace (things like spaces, tabs, or returns). To properly group or count such data, we need to 'strip' away the extra whitespace. If we don't, Python will treat the string 'football' as different from the string ' football' or 'football'.

The string method .strip() will remove extraneous whitespace from before and after a string. The equivalent in Pandas is the method .str.strip().

sport_series = df.loc[:, 'sport'] \
                 .dropna() \
                 .str.lower()

contains_football_mask = sport_series.str.contains('football')

sport_series[contains_football_mask].value_counts()
football                                                                                                                                40
football                                                                                                                                 4
basketball, football, baseball                                                                                                           2
soccer (the real football)                                                                                                               1
football or wrestling                                                                                                                    1
college football                                                                                                                         1
if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff     1
football (mainstream) or something out there like rock climbing                                                                          1
football/basketball                                                                                                                      1
Name: sport, dtype: int64

Notice above that the first and second entries both appear to be 'football', but they are not the same in Python. .str.strip() will help us out here.

# using str.strip() to remove whitespace
sport_series.loc[contains_football_mask] \
            .str.strip() \
            .value_counts()
football                                                                                                                                44
basketball, football, baseball                                                                                                           2
soccer (the real football)                                                                                                               1
football or wrestling                                                                                                                    1
college football                                                                                                                         1
if video games count, super smash bros. if not, football. sometimes baseball when they're not playing the game and doing wacky stuff     1
football (mainstream) or something out there like rock climbing                                                                          1
football/basketball                                                                                                                      1
Name: sport, dtype: int64

If you need to be a bit more careful about how you are stripping whitespace, the functions str.lstrip() and str.rstrip() are available to just strip whitespace on the left or on the right, respectively.