Skip to main content
Completion requirements

Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.

Converting String and Mixed Columns to Numeric

Data cleaning often involves taking numeric data and cleaning it up so that you can perform statistical analyses or visualizations. Dirty numeric data in a series will often be inferred by Pandas as the datatype 'object'. Once we've got the data fairly clean, we can use pd.to_numeric() to convert a series into a numeric data type.

When we call pd.to_numeric(), Pandas will attempt to convert the values of whatever object is specified into a numeric type. If a value cannot be converted, pd.to_numeric() will halt and raise an exception. However, if we know some values will not be converted and we want Pandas to ignore the exceptions, we can set an argument errors = 'coerce', and Pandas will convert the exceptions to missing values instead of throwing an exception.

# check the values of year before using pd.to_numeric()
df.loc[:, 'year'].value_counts()
3000    9
1999    7
1985    7
1776    6
2016    5
       ..
1964    1
2019    1
1700    1
1976    1
1950    1
Name: year, Length: 114, dtype: int64
# check for entries longer than 4 characters
df.loc[:, 'year'] \
  .loc[df.loc[:, 'year'].str.len() > 4] \
  .head(15)
0                 Future, no options for me in the past
5     1999 so I could go around saying that I'm from...
6     300 BC Greece, but only for a day bc no ac or ...
13                               1985 (Bears SuperBowl)
14                                               1980's
15                                     October 21, 2015
22                                         1,900,000 BC
24                                                1990s
30                                               1400's
35    2050 to see if we still exist or if the world ...
52                                               882 AD
53                           the earliest year possible
71                           This one. The past is bad.
72                                         1965 or 1977
87                                Sometime in the 1920s
Name: year, dtype: object
# assign only first four character to series
year_series = df.loc[:, 'year'].str[:4]
year_series.value_counts()
3000    9
1999    8
1985    8
1920    7
2000    6
       ..
2222    1
1902    1
2007    1
1987    1
summ    1
Name: year, Length: 100, dtype: int64
# creating a new series using pd.to_numeric()
year_series_numeric = pd.to_numeric(year_series, errors = 'coerce')

# check the values of year after using pd.to_numeric()
year_series_numeric.value_counts()
3000.0    9
1999.0    8
1985.0    8
1920.0    7
1776.0    6
         ..
1000.0    1
100.0     1
1184.0    1
500.0     1
2007.0    1
Name: year, Length: 85, dtype: int64
# getting some descriptive stats with the numeric version of year
print(year_series_numeric.mean(),
      year_series_numeric.min(),
      year_series_numeric.max(),
      year_series_numeric.mode())
2017.2342857142858 0.0 6298.0 0    3000.0
dtype: float64
# histogram of somewhat cleaned time machine question
sns.histplot(year_series_numeric)
plt.show()

# histogram of somewhat cleaned time machine question restricted to 1900-2100
sns.histplot(year_series_numeric \
             .loc[year_series_numeric \
             .isin(range(1900, 2100))])
plt.show()