The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.
As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv')
will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming
import pandas as pd
has been invoked, you can download the data from the textbook URL as follows
url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
df = pd.read_csv(url)
which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
df.head(10)
which will print out the first 10 rows of the dataframe.
However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:
#read the data from local drive
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.
Data Structures and Types
Missing Data
Both numpy
and pandas
allow for missing values, which are a reality in data science. The missing values are coded as np.nan
. Let's create some data and force some missing values
df = pd.DataFrame(np.random.randn(5, 3), index = ['a','c','e', 'f','g'], columns = ['one','two','three']) # pre-specify index and column names
df['four'] = 20 # add a column named "four", which will all be 20
df['five'] = df['one'] > 0
df
one two three four five a -0.706987 -0.821679 1.441257 20 False c 1.297128 0.501395 0.572570 20 True e -0.761507 1.469939 0.400255 20 False f -0.910821 0.449404 0.588879 20 False g -0.718350 -0.364237 1.793386 20 False
df2 = df.reindex(['a','b','c','d','e','f','g'])
df2.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cbd6040>
The code above is creating new blank rows based on the new index values, some of which are present in the existing data and some of which are missing.
We can create masks of the data indicating where missing values reside in a data set.
df2.isna()
one two three four five a False False False False False b True True True True True c False False False False False d True True True True True e False False False False False f False False False False False g False False False False False
df2['one'].notna()
a True b False c True d False e True f True g True Name: one, dtype: bool
We can obtain complete data by dropping any row that has any missing value. This is called complete case analysis, and you should be very careful using it. It is only valid if we believe that the missingness is missing at random and not related to some characteristic of the data or the data gathering process.
df2.dropna(how='any')
one two three four five a -0.706987 -0.821679 1.441257 20.0 False c 1.297128 0.501395 0.572570 20.0 True e -0.761507 1.469939 0.400255 20.0 False f -0.910821 0.449404 0.588879 20.0 False g -0.718350 -0.364237 1.793386 20.0 False
You can also fill in, or impute, missing values. This can be done using a single value.
out1 = df2.fillna(value = 5)
out1.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cf5fca0>
or a computed value like a column mean
df3 = df2.copy() df3 = df3.select_dtypes(exclude=[object]) # remove non-numeric columns out2 = df3.fillna(df3.mean()) # df3.mean() computes column-wise means
out2.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cf830d0>
You can also impute based on the principle of last value carried forward, which is common in time series. This means that the missing value is imputed with the previous recorded value.
out3 = df2.fillna(method = 'ffill') # Fill forward out3.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cbeca60>
out4 = df2.fillna(method = 'bfill') # Fill backward out4.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11c