Completion requirements
Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.
Example with 'hangout'
Let's see if we can figure out the average for responses to the question: "What is the optimal number of people to hang out with?"
# drop duplicates df_drop = df.drop_duplicates(subset = df.columns[1:]) # drop missing values hangout_series = df_drop.loc[:, 'hangout'] # check the values hangout_series.value_counts()
3 50 4 42 5 23 6 15 2 15 7 7 4-5 6 1 6 5-6 2 8 2 10 2 4-6 2 100 2 3-7, more is too crowded 1 1-6 1 3-4 1 3 others (4 total) 1 3 and me for a game of euchre 1 one 1 1-2 1 People with similar interests 1 20 1 6-8 1 3 (4 including me) 1 depends, 4-6 is most comfortable though 1 Infinite 1 9 1 15 1 4 (including me) 1 5-7 1 6-10 1 Five 1 11? I have zero preference on this one. 1 5 or 6 1 Unsure 1 infinity 1 five 1 Name: hangout, dtype: int64
# address range responses, like 4-5, by splitting on - and taking the first value hangout_series_clean = hangout_series.str.split('-').str[0] # check the values hangout_series_clean.value_counts()
3 52 4 50 5 26 6 17 2 15 1 8 7 7 10 2 100 2 8 2 infinity 1 one 1 3 and me for a game of euchre 1 3 (4 including me) 1 15 1 20 1 3 others (4 total) 1 9 1 Five 1 4 (including me) 1 5 or 6 1 People with similar interests 1 11? I have zero preference on this one. 1 depends, 4 1 Infinite 1 Unsure 1 five 1 Name: hangout, dtype: int64
# address mixed number response by splitting on ' ' and taking the first value of the resulting list hangout_series_clean = hangout_series_clean.str.split(' ').str[0] # check the values hangout_series_clean.value_counts()
3 55 4 51 5 27 6 17 2 15 1 8 7 7 100 2 10 2 8 2 one 1 Five 1 20 1 15 1 depends, 1 Infinite 1 9 1 People 1 Unsure 1 11? 1 five 1 infinity 1 Name: hangout, dtype: int64
# address 'one', 'five', '11?', and 'infinite' using replace hangout_series_clean = hangout_series_clean.str.lower() hangout_series_clean = hangout_series_clean.str.replace('one', '1') hangout_series_clean = hangout_series_clean.str.replace('five', '5') hangout_series_clean = hangout_series_clean.str.replace('?', '') hangout_series_clean = hangout_series_clean.str.replace('infinite', 'infinity') # check the values hangout_series_clean.value_counts()
3 55 4 51 5 29 6 17 2 15 1 9 7 7 10 2 infinity 2 8 2 100 2 9 1 unsure 1 depends, 1 20 1 15 1 11 1 people 1 Name: hangout, dtype: int64
# convert to numeric using coerce argument hangout_series_clean = pd.to_numeric(hangout_series_clean, errors = 'coerce') # check the values hangout_series_clean.value_counts()
3.0 55 4.0 51 5.0 29 6.0 17 2.0 15 1.0 9 7.0 7 inf 2 100.0 2 10.0 2 8.0 2 15.0 1 11.0 1 20.0 1 9.0 1 Name: hangout, dtype: int64
Note that infinite can be represented by some numeric data types, but including it does odd things to our mean and standard deviation statistics.
# infinite does odd things to our descriptive statistics hangout_series_clean.describe()
count 195.0 mean inf std NaN min 1.0 25% 3.0 50% 4.0 75% 5.0 max inf Name: hangout, dtype: float64
# use a mask to remove the infinite values which are represented as np.inf import numpy as np # need this import to address the infinite value hangout_series_clean = hangout_series_clean.loc[hangout_series_clean != np.inf] # check the result hangout_series_clean.describe()
count 193.000000 mean 5.155440 std 9.970095 min 1.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 100.000000 Name: hangout, dtype: float64
# histogram of somewhat cleaned hangout preference sns.histplot(hangout_series_clean) plt.show()
# histogram of somewhat cleaned hangout preference sns.histplot(hangout_series_clean.loc[hangout_series_clean < 25]) plt.show()