pandas Dataframes

Exploring a Data Set

We would like to get some idea about this data set. There are a bunch of functions linked to the DataFrame object that help us in this. First, we will use head to see the first 8 rows of this data set

mtcars.head(8)

                make   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  ...  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  ...  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  ...  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  ...  17.02   0   0     3     2
5            Valiant  18.1    6  225.0  105  ...  20.22   1   0     3     1
6         Duster 360  14.3    8  360.0  245  ...  15.84   0   0     3     4
7          Merc 240D  24.4    4  146.7   62  ...  20.00   1   0     4     2

[8 rows x 12 columns]

This is our first look into this data. We notice a few things. Each column has a name, and each row has an index, starting at 0.

If you're interested in the last N rows, there is a corresponding tail function

Let's look at the data types of each of the columns

mtcars.dtypes

make     object
mpg     float64
cyl       int64
disp    float64
hp        int64
drat    float64
wt      float64
qsec    float64
vs        int64
am        int64
gear      int64
carb      int64
dtype: object

This tells us that some of the variables, like mpg and disp, are floating point (decimal) numbers, several are integers, and make is an "object". The dtypes function borrows from numpy, where there isn't really a type for character or categorical variables. So most often, when you see "object" in the output of dtypes, you think it's a character or categorical variable.

We can also look at the data structure in a bit more detail.

mtcars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   make    32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB

This tells us that this is indeed a DataFrame with 12 columns, each with 32 valid observations. Each row has an index value ranging from 0 to 11. We also get the approximate size of this object in memory.

You can also quickly find the number of rows and columns of a data set by using shape, which is borrowed from numpy.

mtcars.shape

(32, 12)

More generally, we can get a summary of each variable using the describe function

mtcars.describe()

             mpg        cyl        disp  ...         am       gear     carb
count  32.000000  32.000000   32.000000  ...  32.000000  32.000000  32.0000
mean   20.090625   6.187500  230.721875  ...   0.406250   3.687500   2.8125
std     6.026948   1.785922  123.938694  ...   0.498991   0.737804   1.6152
min    10.400000   4.000000   71.100000  ...   0.000000   3.000000   1.0000
25%    15.425000   4.000000  120.825000  ...   0.000000   3.000000   2.0000
50%    19.200000   6.000000  196.300000  ...   0.000000   4.000000   2.0000
75%    22.800000   8.000000  326.000000  ...   1.000000   4.000000   4.0000
max    33.900000   8.000000  472.000000  ...   1.000000   5.000000   8.0000

[8 rows x 11 columns]

These are usually the first steps in exploring the data.