Example with 'hangout'

Let's see if we can figure out the average for responses to the question: "What is the optimal number of people to hang out with?"

# drop duplicates
df_drop = df.drop_duplicates(subset = df.columns[1:])

# drop missing values
hangout_series = df_drop.loc[:, 'hangout']

# check the values
hangout_series.value_counts()
3                                          50
4                                          42
5                                          23
6                                          15
2                                          15
7                                           7
4-5                                         6
1                                           6
5-6                                         2
8                                           2
10                                          2
4-6                                         2
100                                         2
3-7, more is too crowded                    1
1-6                                         1
3-4                                         1
3 others (4 total)                          1
3 and me for a game of euchre               1
one                                         1
1-2                                         1
People with similar interests               1
20                                          1
6-8                                         1
3 (4 including me)                          1
depends, 4-6 is most comfortable though     1
Infinite                                    1
9                                           1
15                                          1
4 (including me)                            1
5-7                                         1
6-10                                        1
Five                                        1
11? I have zero preference on this one.     1
5 or 6                                      1
Unsure                                      1
infinity                                    1
five                                        1
Name: hangout, dtype: int64
# address range responses, like 4-5, by splitting on - and taking the first value
hangout_series_clean = hangout_series.str.split('-').str[0]

# check the values
hangout_series_clean.value_counts()
3                                          52
4                                          50
5                                          26
6                                          17
2                                          15
1                                           8
7                                           7
10                                          2
100                                         2
8                                           2
infinity                                    1
one                                         1
3 and me for a game of euchre               1
3 (4 including me)                          1
15                                          1
20                                          1
3 others (4 total)                          1
9                                           1
Five                                        1
4 (including me)                            1
5 or 6                                      1
People with similar interests               1
11? I have zero preference on this one.     1
depends, 4                                  1
Infinite                                    1
Unsure                                      1
five                                        1
Name: hangout, dtype: int64
# address mixed number response by splitting on ' '  and taking the first value of the resulting list
hangout_series_clean = hangout_series_clean.str.split(' ').str[0]

# check the values
hangout_series_clean.value_counts()
3           55
4           51
5           27
6           17
2           15
1            8
7            7
100          2
10           2
8            2
one          1
Five         1
20           1
15           1
depends,     1
Infinite     1
9            1
People       1
Unsure       1
11?          1
five         1
infinity     1
Name: hangout, dtype: int64
# address 'one', 'five', '11?', and 'infinite' using replace
hangout_series_clean = hangout_series_clean.str.lower()
hangout_series_clean = hangout_series_clean.str.replace('one', '1')
hangout_series_clean = hangout_series_clean.str.replace('five', '5')
hangout_series_clean = hangout_series_clean.str.replace('?', '')
hangout_series_clean = hangout_series_clean.str.replace('infinite', 'infinity')

# check the values
hangout_series_clean.value_counts()
3           55
4           51
5           29
6           17
2           15
1            9
7            7
10           2
infinity     2
8            2
100          2
9            1
unsure       1
depends,     1
20           1
15           1
11           1
people       1
Name: hangout, dtype: int64
# convert to numeric using coerce argument
hangout_series_clean = pd.to_numeric(hangout_series_clean, errors = 'coerce')

# check the values
hangout_series_clean.value_counts()
3.0      55
4.0      51
5.0      29
6.0      17
2.0      15
1.0       9
7.0       7
inf       2
100.0     2
10.0      2
8.0       2
15.0      1
11.0      1
20.0      1
9.0       1
Name: hangout, dtype: int64

Note that infinite can be represented by some numeric data types, but including it does odd things to our mean and standard deviation statistics.

# infinite does odd things to our descriptive statistics
hangout_series_clean.describe()
count    195.0
mean       inf
std        NaN
min        1.0
25%        3.0
50%        4.0
75%        5.0
max        inf
Name: hangout, dtype: float64
# use a mask to remove the infinite values which are represented as np.inf
import numpy as np # need this import to address the infinite value
hangout_series_clean = hangout_series_clean.loc[hangout_series_clean != np.inf]

# check the result
hangout_series_clean.describe()
count    193.000000
mean       5.155440
std        9.970095
min        1.000000
25%        3.000000
50%        4.000000
75%        5.000000
max      100.000000
Name: hangout, dtype: float64
# histogram of somewhat cleaned hangout preference
sns.histplot(hangout_series_clean)
plt.show()


# histogram of somewhat cleaned hangout preference
sns.histplot(hangout_series_clean.loc[hangout_series_clean < 25])
plt.show()