The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.
As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv')
will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming
import pandas as pd
has been invoked, you can download the data from the textbook URL as follows
url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
df = pd.read_csv(url)
which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
df.head(10)
which will print out the first 10 rows of the dataframe.
However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:
#read the data from local drive
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.
Data Structures and Types
pandas.DataFrame
The DataFrame
object holds a rectangular data set. Each column of a DataFrame
is a Series
object. This means that each column of a DataFrame
must be comprised of data of the same type, but different columns can
hold data of different types. This structure is extremely useful in
practical data science. The invention of this structure was, in my
opinion, transformative in making Python an effective data science tool.
Creating a DataFrame
The DataFrame
can be created by importing data, as we saw in the previous section. It can also be created by a few methods within Python.
First, it can be created from a 2-dimensional numpy
array.
rng = np.random.RandomState(25) d1 = pd.DataFrame(rng.normal(0,1, (4,5))) d1
0 1 2 3 4 0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 1 -0.222326 -0.619915 1.837905 -2.053231 0.868583 2 -0.920734 -0.232312 2.152957 -1.334661 0.076380 3 -1.246089 1.202272 -1.049942 1.056610 -0.419678
You will notice that it creates default column names that are merely
the column number, starting from 0. We can also create the column names
and row index (similar to the Series
index we saw earlier) directly during creation.
d2 = pd.DataFrame(rng.normal(0,1, (4, 5)), columns = ['A','B','C','D','E'], index = ['a','b','c','d']) d2
A B C D E a 2.294842 -2.594487 2.822756 0.680889 -1.577693 b -1.976254 0.533340 -0.290870 -0.513520 1.982626 c 0.226001 -1.839905 1.607671 0.388292 0.399732 d 0.405477 0.217002 -0.633439 0.246622 -1.939546
We could also create a
DataFrame
from a list of lists, as long as things line up, just as we showed fornumpy
arrays. However, to me, other ways, including thedict
method below, make more sense.
We can change the column names (which can be extracted and replaced with the columns
attribute) and the index values (using the index
attribute).
d2.columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
d2.columns = pd.Index(['V'+str(i) for i in range(1,6)]) # Index creates the right objects for both column names and row names, # which can be extracted and changed with the `index` attribute d2
V1 V2 V3 V4 V5 a 2.294842 -2.594487 2.822756 0.680889 -1.577693 b -1.976254 0.533340 -0.290870 -0.513520 1.982626 c 0.226001 -1.839905 1.607671 0.388292 0.399732 d 0.405477 0.217002 -0.633439 0.246622 -1.939546
Exercise: Can you explain what I did in the list comprehension above? The key points are understanding str
and how I constructed the range
.
d2.index = ['o1','o2','o3','o4'] d2
V1 V2 V3 V4 V5 o1 2.294842 -2.594487 2.822756 0.680889 -1.577693 o2 -1.976254 0.533340 -0.290870 -0.513520 1.982626 o3 0.226001 -1.839905 1.607671 0.388292 0.399732 o4 0.405477 0.217002 -0.633439 0.246622 -1.939546
You can also extract data from a homogeneous DataFrame
to a numpy
array
d1.to_numpy()
array([[ 0.22827309, 1.0268903 , -0.83958485, -0.59118152, -0.9568883 ], [-0.22232569, -0.61991511, 1.83790458, -2.05323076, 0.86858305], [-0.92073444, -0.23231186, 2.1529569 , -1.33466147, 0.07637965], [-1.24608928, 1.20227231, -1.04994158, 1.05661011, -0.41967767]])
It turns out that you can use
to_numpy
for a non-homogeneousDataFrame
as well.numpy
just makes it homogeneous by assigning each column the data typeobject
. This also limits what you can do innumpy
with the array and may require changing data types using theastype
function. There is some more detail about theobject
data type in the Python Tools for Data Science (notebook, PDF) document.
The other easy way to create a DataFrame
is from a dict
object, where each component object is either a list or a numpy array and is homogeneous in type. One exception is if a component is of size
1; then it is repeated to meet the needs of the DataFrame
's dimensions
df = pd.DataFrame({ 'A':3., 'B':rng.random_sample(5), 'C': pd.Timestamp('20200512'), 'D': np.array([6] * 5), 'E': pd.Categorical(['yes','no','no','yes','no']), 'F': 'NIH'}) df
A B C D E F 0 3.0 0.958092 2020-05-12 6 yes NIH 1 3.0 0.883201 2020-05-12 6 no NIH 2 3.0 0.295432 2020-05-12 6 no NIH 3 3.0 0.512376 2020-05-12 6 yes NIH 4 3.0 0.088702 2020-05-12 6 no NIH
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null float64 1 B 5 non-null float64 2 C 5 non-null datetime64[ns] 3 D 5 non-null int64 4 E 5 non-null category 5 F 5 non-null object dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(1) memory usage: 429.0+ bytes
We note that C is a date object, E is a category object, and F is a
text/string object. pandas has excellent time series capabilities
(having origins in FinTech), and the TimeStamp
function
creates DateTime objects which can be queried and manipulated in Python.
We'll describe category data in the next section.
You can also create a DataFrame
where each column is
composed of composite objects, like lists and dicts, as well. This might
have limited value in some settings but may be useful in others. In
particular, this allows capabilities like the list-column construct in R Tibbles. For example,
pd.DataFrame({'list' :[[1,2],[3,4],[5,6]], 'tuple' : [('a','b'), ('c','d'), ('e','f')], 'set' : {'A','B','C'}, {'D','E'}, {'F'}, 'dicts' : {'A': [1,2,3]}, {'B':[5,6,8]}, {'C': [3,9]}})
list tuple set dicts 0 [1, 2] (a, b) {B, A, C} {'A': [1, 2, 3]} 1 [3, 4] (c, d) {E, D} {'B': [5, 6, 8]} 2 [5, 6] (e, f) {F} {'C': [3, 9]}
Working with a DataFrame
You can extract particular columns of a DataFrame
by name
df['E']
0 yes 1 no 2 no 3 yes 4 no Name: E, dtype: category Categories (2, object): [no, yes]
df['B']
0 0.958092 1 0.883201 2 0.295432 3 0.512376 4 0.088702 Name: B, dtype: float64
There is also a shortcut for accessing single columns, using Python's dot (
.
) notation.
df.B
0 0.958092 1 0.883201 2 0.295432 3 0.512376 4 0.088702 Name: B, dtype: float64
This notation can be more convenient if we need to perform operations on a single column. If we want to extract multiple columns, this notation will not work. Also, if we want to create new columns or replace existing columns, we need to use the array notation with the column name in quotes.
Let's look at slicing a DataFrame
Extracting rows and columns
There are two extractor functions in pandas
:
loc
extracts by label (index label, column label, a slice of labels, etc.iloc
extracts by index (integers, slice objects, etc.
df2 = pd.DataFrame(rng.randint(0,10, (5,4)), index = ['a','b','c','d','e'], columns = ['one','two','three','four']) df2
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
First, let's see what naively slicing this DataFrame
does.
df2['one']
a 5 b 9 c 8 d 5 e 6 Name: one, dtype: int64
Ok, that works. It grabs one column from the dataset. How about the dot notation?
df2.one
a 5 b 9 c 8 d 5 e 6 Name: one, dtype: int64
Let's see what this produces.
type(df2.one)
<class 'pandas.core.series.Series'>
So this is a series, so we can potentially do slicing of this series.
df2.one['b']
9
df2.one['b':'d']
b 9 c 8 d 5 Name: one, dtype: int64
df2.one[:3]
a 5 b 9 c 8 Name: one, dtype: int64
Ok, so we have all the Series
slicing available. The
problem here is in semantics, in that we are grabbing one column and
then slicing the rows. That doesn't quite work with our sense that a DataFrame
is a rectangle with rows and columns, and we tend to think of rows, then columns.
Let's see if we can do column slicing with this.
df2[:'two']
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
That's not what we want, of course. It's giving back the entire data frame. We'll come back to this.
df2[['one','three']]
one three a 5 2 b 9 0 c 8 3 d 5 7 e 6 8
That works correctly, though. We can give a list of column names. Ok.
How about row slices?
#df2['a'] # Doesn't work df2['a':'c']
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3
Ok, that works. It slices rows but includes the largest index, like a Series
but unlike numpy
arrays.
df2[0:2]
one two three four a 5 3 2 8 b 9 3 0 5
Slices by location work too, but use the numpy
slicing rules.
This entire extraction method becomes confusing. Let's simplify
things for this and then move on to more consistent ways to extract
elements of a DataFrame
. Let's agree on two things. If we're going the direct extraction route,
- We will extract single columns of a
DataFrame
with[]
or.
, i.e.,df2['one']
ordf2.one
- We will extract slices of rows of a
DataFrame
using location only, i.e.,df2[:3]
.
For everything else, we'll use two functions, loc
and iloc
.
loc
extracts elements like a matrix, using index and columnsiloc
extracts elements like a matrix, using location
df2.loc[:,'one':'three']
one two three a 5 3 2 b 9 3 0 c 8 4 3 d 5 2 7 e 6 7 8
df2.loc['a':'d',:]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1
df2.loc['b', 'three']
0
So loc
works just like a matrix, but with pandas
slicing rules (include the largest index)
df2.iloc[:,1:4]
two three four a 3 2 8 b 3 0 5 c 4 3 3 d 2 7 1 e 7 8 7
df2.iloc[1:3,:]
one two three four b 9 3 0 5 c 8 4 3 3
df2.iloc[1:3, 1:4]
two three four b 3 0 5 c 4 3 3
iloc
slices like a matrix but uses numpy
slicing conventions (does not include the highest index)
If we want to extract a single element from a dataset, there are two functions available, iat
and at
, with behavior corresponding to iloc
and loc
, respectively.
df2.iat[2,3]
3
df2.at['b','three']
0
Boolean selection
We can also use tests to extract data from a DataFrame
. For example, we can extract only rows where the column labeled one
is greater than 3.
df2[df2.one > 3]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
We can also do composite tests. Here we ask for rows where one
is greater than 3 and three
is less than 9
df2[(df2.one > 3) & (df2.three < 9)]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
query
DataFrame
's have a query
method allowing selection using a Python expression
n = 10 df = pd.DataFrame(np.random.rand(n, 3), columns = list('abc')) df
a b c 0 0.824319 0.095623 0.459736 1 0.200426 0.302179 0.534959 2 0.381447 0.775250 0.654027 3 0.482755 0.512036 0.763493 4 0.679697 0.972007 0.946838 5 0.111028 0.909652 0.136886 6 0.817464 0.499767 0.769067 7 0.921952 0.481407 0.479161 8 0.626052 0.484680 0.518859 9 0.497046 0.045412 0.324195
df[(df.a < df.b) & (df.b < df.c)]
a b c 1 0.200426 0.302179 0.534959 3 0.482755 0.512036 0.763493
We can equivalently write this query as
df.query('(a < b) & (b < c)')
a b c 1 0.200426 0.302179 0.534959 3 0.482755 0.512036 0.763493
Replacing values in a DataFrame
We can replace values within a DataFrame either by position or using a query.
df2
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
df2['one'] = [2,5,2,5,2] df2
one two three four a 2 3 2 8 b 5 3 0 5 c 2 4 3 3 d 5 2 7 1 e 2 7 8 7
df2.iat[2,3] = -9 # missing value df2
one two three four a 2 3 2 8 b 5 3 0 5 c 2 4 3 -9 d 5 2 7 1 e 2 7 8 7
Let's now replace values using replace
, which is more flexible.
df2.replace(0, -9) # replace 0 with -9
one two three four a 2 3 2 8 b 5 3 -9 5 c 2 4 3 -9 d 5 2 7 1 e 2 7 8 7
df2.replace({2: 2.5, 8: 6.5}) # multiple replacements
one two three four a 2.5 3.0 2.5 6.5 b 5.0 3.0 0.0 5.0 c 2.5 4.0 3.0 -9.0 d 5.0 2.5 7.0 1.0 e 2.5 7.0 6.5 7.0
df2.replace({'one': {5: 500}, 'three': {0: -9, 8: 800}}) # different replacements in different columns
one two three four a 2 3 2 8 b 500 3 -9 5 c 2 4 3 -9 d 500 2 7 1 e 2 7 800 7