Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.
Converting String and Mixed Columns to Numeric
Data cleaning often involves taking numeric data and cleaning it up so
that you can perform statistical analyses or visualizations. Dirty
numeric data in a series will often be inferred by Pandas as the
datatype 'object'. Once we've got the data fairly clean, we can use
pd.to_numeric()
to convert a series into a numeric data type.
When we call pd.to_numeric()
, Pandas will attempt to convert the
values of whatever object is specified into a numeric type. If a value
cannot be converted, pd.to_numeric()
will halt and raise an
exception. However, if we know some values will not be converted and we
want Pandas to ignore the exceptions, we can set an argument
errors = 'coerce'
, and Pandas will convert the exceptions to missing
values instead of throwing an exception.
# check the values of year before using pd.to_numeric() df.loc[:, 'year'].value_counts()
3000 9 1999 7 1985 7 1776 6 2016 5 .. 1964 1 2019 1 1700 1 1976 1 1950 1 Name: year, Length: 114, dtype: int64
# check for entries longer than 4 characters df.loc[:, 'year'] \ .loc[df.loc[:, 'year'].str.len() > 4] \ .head(15)
0 Future, no options for me in the past 5 1999 so I could go around saying that I'm from... 6 300 BC Greece, but only for a day bc no ac or ... 13 1985 (Bears SuperBowl) 14 1980's 15 October 21, 2015 22 1,900,000 BC 24 1990s 30 1400's 35 2050 to see if we still exist or if the world ... 52 882 AD 53 the earliest year possible 71 This one. The past is bad. 72 1965 or 1977 87 Sometime in the 1920s Name: year, dtype: object
# assign only first four character to series year_series = df.loc[:, 'year'].str[:4] year_series.value_counts()
3000 9 1999 8 1985 8 1920 7 2000 6 .. 2222 1 1902 1 2007 1 1987 1 summ 1 Name: year, Length: 100, dtype: int64
# creating a new series using pd.to_numeric() year_series_numeric = pd.to_numeric(year_series, errors = 'coerce') # check the values of year after using pd.to_numeric() year_series_numeric.value_counts()
3000.0 9 1999.0 8 1985.0 8 1920.0 7 1776.0 6 .. 1000.0 1 100.0 1 1184.0 1 500.0 1 2007.0 1 Name: year, Length: 85, dtype: int64
# getting some descriptive stats with the numeric version of year print(year_series_numeric.mean(), year_series_numeric.min(), year_series_numeric.max(), year_series_numeric.mode())
2017.2342857142858 0.0 6298.0 0 3000.0 dtype: float64
# histogram of somewhat cleaned time machine question sns.histplot(year_series_numeric) plt.show()
# histogram of somewhat cleaned time machine question restricted to 1900-2100 sns.histplot(year_series_numeric \ .loc[year_series_numeric \ .isin(range(1900, 2100))]) plt.show()