The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.
As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv')
will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming
import pandas as pd
has been invoked, you can download the data from the textbook URL as follows
url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
df = pd.read_csv(url)
which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
df.head(10)
which will print out the first 10 rows of the dataframe.
However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:
#read the data from local drive
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.
Data Structures and Types
Categorical Data
pandas
provides a Categorical
function and a category
object type to Python. This type is analogous to the factor
data type in R. It is meant to address categorical or discrete
variables, where we need to use them in analyses. Categorical variables
typically take on a small number of unique values, like gender, blood
type, country of origin, race, etc.
You can create categorical Series
in a couple of ways:
s = pd.Series(['a','b','c'], dtype='category')
df = pd.DataFrame({ 'A':3., 'B':rng.random_sample(5), 'C': pd.Timestamp('20200512'), 'D': np.array([6] * 5), 'E': pd.Categorical(['yes','no','no','yes','no']), 'F': 'NIH'}) df['F'].astype('category')
0 NIH 1 NIH 2 NIH 3 NIH 4 NIH Name: F, dtype: category Categories (1, object): [NIH]
You can also create DataFrame
's where each column is categorical
df = pd.DataFrame({'A': list('abcd'), 'B': list('bdca')}) df_cat = df.astype('category') df_cat.dtypes
A category B category dtype: object
You can explore categorical data in a variety of ways
df_cat['A'].describe()
count 4 unique 4 top d freq 1 Name: A, dtype: object
df['A'].value_counts()
d 1 b 1 a 1 c 1 Name: A, dtype: int64
One issue with categories is that if a particular level of a category is not seen before, it can create an error. So you can pre-specify the categories you expect
df_cat['B'] = pd.Categorical(list('aabb'), categories = ['a','b','c','d']) df_cat['B'].value_counts()
b 2 a 2 d 0 c 0 Name: B, dtype: int64
Re-organizing categories
In categorical data, there is often the concept of a "first" or "reference" category and an ordering of categories. This tends to be
important in both visualization as well as in regression modeling. Both
aspects of a category can be addressed using the reorder_categories
function.
In our earlier example, we can see that the A
variable has 4 categories, with the "first" category being "a".
df_cat.A
0 a 1 b 2 c 3 d Name: A, dtype: category Categories (4, object): [a, b, c, d]
Suppose we want to change this ordering to the reverse ordering, where "d" is the "first" category, and then it goes in reverse order.
df_cat['A'] = df_cat.A.cat.reorder_categories(['d','c','b','a'])
df_cat.A
0 a 1 b 2 c 3 d Name: A, dtype: category Categories (4, object): [d, c, b, a]