Examining the Dataframe for Errors

We had previously used .info() for checking column names and row numbers. It has a few more uses when we've got dirty data.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Timestamp    204 non-null    object
 1   musicartist  193 non-null    object
 2   height       203 non-null    object
 3   city         202 non-null    object
 4   30min        189 non-null    object
 5   travel       202 non-null    object
 6   likepizza    203 non-null    float64
 7   deepdish     203 non-null    object
 8   sport        197 non-null    object
 9   spell        203 non-null    object
 10  hangout      203 non-null    object
 11  talk         202 non-null    object
 12  year         197 non-null    object
 13  quote        189 non-null    object
dtypes: float64(1), object(13)
memory usage: 22.4+ KB

We can see the number of entries (rows), the number of columns and their names, the non-null count (not missing), and the inferred datatype of each column.

Null refers to missing or null values. In this particular dataset, we have missing values in every single column except the first.

The inferred column data types are all objects, except for 'likepizza' which is a float. This means every other column has values of mixed data types or strings in it, which might be entirely appropriate for most columns. However, it's notable that some columns we might expect to be numeric are not. For example, the column 'hangout' is responses to the question: ‘What is the optimal number of people to hang out with?' We will need to dig into this a bit to see what's going on and convert this column to a numeric data type before we can start using statistical tools like .mean() with this column.