The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.
As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv')
will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming
import pandas as pd
has been invoked, you can download the data from the textbook URL as follows
url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
df = pd.read_csv(url)
which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
df.head(10)
which will print out the first 10 rows of the dataframe.
However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:
#read the data from local drive
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.
Exploring a Data Set
We would like to get some idea about this data set. There are a bunch of functions linked to the DataFrame
object that help us in this. First, we will use head
to see the first 8 rows of this data set
mtcars.head(8)
make mpg cyl disp hp ... qsec vs am gear carb 0 Mazda RX4 21.0 6 160.0 110 ... 16.46 0 1 4 4 1 Mazda RX4 Wag 21.0 6 160.0 110 ... 17.02 0 1 4 4 2 Datsun 710 22.8 4 108.0 93 ... 18.61 1 1 4 1 3 Hornet 4 Drive 21.4 6 258.0 110 ... 19.44 1 0 3 1 4 Hornet Sportabout 18.7 8 360.0 175 ... 17.02 0 0 3 2 5 Valiant 18.1 6 225.0 105 ... 20.22 1 0 3 1 6 Duster 360 14.3 8 360.0 245 ... 15.84 0 0 3 4 7 Merc 240D 24.4 4 146.7 62 ... 20.00 1 0 4 2 [8 rows x 12 columns]
This is our first look into this data. We notice a few things. Each column has a name, and each row has an index, starting at 0.
If you're interested in the last N rows, there is a corresponding tail
function
Let's look at the data types of each of the columns
mtcars.dtypes
make object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 dtype: object
This tells us that some of the variables, like mpg
and disp
, are floating point (decimal) numbers, several are integers, and make
is an "object". The dtypes
function borrows from numpy
, where there isn't really a type for character or categorical variables. So most often, when you see "object" in the output of dtypes
, you think it's a character or categorical variable.
We can also look at the data structure in a bit more detail.
mtcars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32 entries, 0 to 31 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 make 32 non-null object 1 mpg 32 non-null float64 2 cyl 32 non-null int64 3 disp 32 non-null float64 4 hp 32 non-null int64 5 drat 32 non-null float64 6 wt 32 non-null float64 7 qsec 32 non-null float64 8 vs 32 non-null int64 9 am 32 non-null int64 10 gear 32 non-null int64 11 carb 32 non-null int64 dtypes: float64(5), int64(6), object(1) memory usage: 3.1+ KB
This tells us that this is indeed a DataFrame
with 12
columns, each with 32 valid observations. Each row has an index value
ranging from 0 to 11. We also get the approximate size of this object in
memory.
You can also quickly find the number of rows and columns of a data set by using shape
, which is borrowed from numpy.
mtcars.shape
(32, 12)
More generally, we can get a summary of each variable using the describe
function
mtcars.describe()
mpg cyl disp ... am gear carb count 32.000000 32.000000 32.000000 ... 32.000000 32.000000 32.0000 mean 20.090625 6.187500 230.721875 ... 0.406250 3.687500 2.8125 std 6.026948 1.785922 123.938694 ... 0.498991 0.737804 1.6152 min 10.400000 4.000000 71.100000 ... 0.000000 3.000000 1.0000 25% 15.425000 4.000000 120.825000 ... 0.000000 3.000000 2.0000 50% 19.200000 6.000000 196.300000 ... 0.000000 4.000000 2.0000 75% 22.800000 8.000000 326.000000 ... 1.000000 4.000000 4.0000 max 33.900000 8.000000 472.000000 ... 1.000000 5.000000 8.0000 [8 rows x 11 columns]
These are usually the first steps in exploring the data.