Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.
Examining the Dataframe for Errors
We had previously used .info()
for checking column names and row
numbers. It has a few more uses when we've got dirty data.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 204 entries, 0 to 203 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Timestamp 204 non-null object 1 musicartist 193 non-null object 2 height 203 non-null object 3 city 202 non-null object 4 30min 189 non-null object 5 travel 202 non-null object 6 likepizza 203 non-null float64 7 deepdish 203 non-null object 8 sport 197 non-null object 9 spell 203 non-null object 10 hangout 203 non-null object 11 talk 202 non-null object 12 year 197 non-null object 13 quote 189 non-null object dtypes: float64(1), object(13) memory usage: 22.4+ KB
We can see the number of entries (rows), the number of columns and their names, the non-null count (not missing), and the inferred datatype of each column.
Null refers to missing or null values. In this particular dataset, we have missing values in every single column except the first.
The inferred column data types are all objects, except for
'likepizza'
which is a float. This means every other column has
values of mixed data types or strings in it, which might be entirely
appropriate for most columns. However, it's notable that some columns we
might expect to be numeric are not. For example, the column
'hangout'
is responses to the question: ‘What is the optimal number
of people to hang out with?' We will need to dig into this a bit to see
what's going on and convert this column to a numeric data type before we
can start using statistical tools like .mean()
with this column.