Data Cleaning
Example with 'hangout'
Let's see if we can figure out the average for responses to the question: "What is the optimal number of people to hang out with?"
# drop duplicates df_drop = df.drop_duplicates(subset = df.columns[1:]) # drop missing values hangout_series = df_drop.loc[:, 'hangout'] # check the values hangout_series.value_counts()
3 50 4 42 5 23 6 15 2 15 7 7 4-5 6 1 6 5-6 2 8 2 10 2 4-6 2 100 2 3-7, more is too crowded 1 1-6 1 3-4 1 3 others (4 total) 1 3 and me for a game of euchre 1 one 1 1-2 1 People with similar interests 1 20 1 6-8 1 3 (4 including me) 1 depends, 4-6 is most comfortable though 1 Infinite 1 9 1 15 1 4 (including me) 1 5-7 1 6-10 1 Five 1 11? I have zero preference on this one. 1 5 or 6 1 Unsure 1 infinity 1 five 1 Name: hangout, dtype: int64
# address range responses, like 4-5, by splitting on - and taking the first value hangout_series_clean = hangout_series.str.split('-').str[0] # check the values hangout_series_clean.value_counts()
3 52 4 50 5 26 6 17 2 15 1 8 7 7 10 2 100 2 8 2 infinity 1 one 1 3 and me for a game of euchre 1 3 (4 including me) 1 15 1 20 1 3 others (4 total) 1 9 1 Five 1 4 (including me) 1 5 or 6 1 People with similar interests 1 11? I have zero preference on this one. 1 depends, 4 1 Infinite 1 Unsure 1 five 1 Name: hangout, dtype: int64
# address mixed number response by splitting on ' ' and taking the first value of the resulting list hangout_series_clean = hangout_series_clean.str.split(' ').str[0] # check the values hangout_series_clean.value_counts()
3 55 4 51 5 27 6 17 2 15 1 8 7 7 100 2 10 2 8 2 one 1 Five 1 20 1 15 1 depends, 1 Infinite 1 9 1 People 1 Unsure 1 11? 1 five 1 infinity 1 Name: hangout, dtype: int64
# address 'one', 'five', '11?', and 'infinite' using replace hangout_series_clean = hangout_series_clean.str.lower() hangout_series_clean = hangout_series_clean.str.replace('one', '1') hangout_series_clean = hangout_series_clean.str.replace('five', '5') hangout_series_clean = hangout_series_clean.str.replace('?', '') hangout_series_clean = hangout_series_clean.str.replace('infinite', 'infinity') # check the values hangout_series_clean.value_counts()
3 55 4 51 5 29 6 17 2 15 1 9 7 7 10 2 infinity 2 8 2 100 2 9 1 unsure 1 depends, 1 20 1 15 1 11 1 people 1 Name: hangout, dtype: int64
# convert to numeric using coerce argument hangout_series_clean = pd.to_numeric(hangout_series_clean, errors = 'coerce') # check the values hangout_series_clean.value_counts()
3.0 55 4.0 51 5.0 29 6.0 17 2.0 15 1.0 9 7.0 7 inf 2 100.0 2 10.0 2 8.0 2 15.0 1 11.0 1 20.0 1 9.0 1 Name: hangout, dtype: int64
Note that infinite can be represented by some numeric data types, but including it does odd things to our mean and standard deviation statistics.
# infinite does odd things to our descriptive statistics hangout_series_clean.describe()
count 195.0 mean inf std NaN min 1.0 25% 3.0 50% 4.0 75% 5.0 max inf Name: hangout, dtype: float64
# use a mask to remove the infinite values which are represented as np.inf import numpy as np # need this import to address the infinite value hangout_series_clean = hangout_series_clean.loc[hangout_series_clean != np.inf] # check the result hangout_series_clean.describe()
count 193.000000 mean 5.155440 std 9.970095 min 1.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 100.000000 Name: hangout, dtype: float64
# histogram of somewhat cleaned hangout preference sns.histplot(hangout_series_clean) plt.show()
# histogram of somewhat cleaned hangout preference sns.histplot(hangout_series_clean.loc[hangout_series_clean < 25]) plt.show()