pandas Dataframes
Data Structures and Types
pandas.DataFrame
The DataFrame
object holds a rectangular data set. Each column of a DataFrame
is a Series
object. This means that each column of a DataFrame
must be comprised of data of the same type, but different columns can
hold data of different types. This structure is extremely useful in
practical data science. The invention of this structure was, in my
opinion, transformative in making Python an effective data science tool.
Creating a DataFrame
The DataFrame
can be created by importing data, as we saw in the previous section. It can also be created by a few methods within Python.
First, it can be created from a 2-dimensional numpy
array.
rng = np.random.RandomState(25) d1 = pd.DataFrame(rng.normal(0,1, (4,5))) d1
0 1 2 3 4 0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 1 -0.222326 -0.619915 1.837905 -2.053231 0.868583 2 -0.920734 -0.232312 2.152957 -1.334661 0.076380 3 -1.246089 1.202272 -1.049942 1.056610 -0.419678
You will notice that it creates default column names that are merely
the column number, starting from 0. We can also create the column names
and row index (similar to the Series
index we saw earlier) directly during creation.
d2 = pd.DataFrame(rng.normal(0,1, (4, 5)), columns = ['A','B','C','D','E'], index = ['a','b','c','d']) d2
A B C D E a 2.294842 -2.594487 2.822756 0.680889 -1.577693 b -1.976254 0.533340 -0.290870 -0.513520 1.982626 c 0.226001 -1.839905 1.607671 0.388292 0.399732 d 0.405477 0.217002 -0.633439 0.246622 -1.939546
We could also create a
DataFrame
from a list of lists, as long as things line up, just as we showed fornumpy
arrays. However, to me, other ways, including thedict
method below, make more sense.
We can change the column names (which can be extracted and replaced with the columns
attribute) and the index values (using the index
attribute).
d2.columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
d2.columns = pd.Index(['V'+str(i) for i in range(1,6)]) # Index creates the right objects for both column names and row names, # which can be extracted and changed with the `index` attribute d2
V1 V2 V3 V4 V5 a 2.294842 -2.594487 2.822756 0.680889 -1.577693 b -1.976254 0.533340 -0.290870 -0.513520 1.982626 c 0.226001 -1.839905 1.607671 0.388292 0.399732 d 0.405477 0.217002 -0.633439 0.246622 -1.939546
Exercise: Can you explain what I did in the list comprehension above? The key points are understanding str
and how I constructed the range
.
d2.index = ['o1','o2','o3','o4'] d2
V1 V2 V3 V4 V5 o1 2.294842 -2.594487 2.822756 0.680889 -1.577693 o2 -1.976254 0.533340 -0.290870 -0.513520 1.982626 o3 0.226001 -1.839905 1.607671 0.388292 0.399732 o4 0.405477 0.217002 -0.633439 0.246622 -1.939546
You can also extract data from a homogeneous DataFrame
to a numpy
array
d1.to_numpy()
array([[ 0.22827309, 1.0268903 , -0.83958485, -0.59118152, -0.9568883 ], [-0.22232569, -0.61991511, 1.83790458, -2.05323076, 0.86858305], [-0.92073444, -0.23231186, 2.1529569 , -1.33466147, 0.07637965], [-1.24608928, 1.20227231, -1.04994158, 1.05661011, -0.41967767]])
It turns out that you can use
to_numpy
for a non-homogeneousDataFrame
as well.numpy
just makes it homogeneous by assigning each column the data typeobject
. This also limits what you can do innumpy
with the array and may require changing data types using theastype
function. There is some more detail about theobject
data type in the Python Tools for Data Science (notebook, PDF) document.
The other easy way to create a DataFrame
is from a dict
object, where each component object is either a list or a numpy array and is homogeneous in type. One exception is if a component is of size
1; then it is repeated to meet the needs of the DataFrame
's dimensions
df = pd.DataFrame({ 'A':3., 'B':rng.random_sample(5), 'C': pd.Timestamp('20200512'), 'D': np.array([6] * 5), 'E': pd.Categorical(['yes','no','no','yes','no']), 'F': 'NIH'}) df
A B C D E F 0 3.0 0.958092 2020-05-12 6 yes NIH 1 3.0 0.883201 2020-05-12 6 no NIH 2 3.0 0.295432 2020-05-12 6 no NIH 3 3.0 0.512376 2020-05-12 6 yes NIH 4 3.0 0.088702 2020-05-12 6 no NIH
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null float64 1 B 5 non-null float64 2 C 5 non-null datetime64[ns] 3 D 5 non-null int64 4 E 5 non-null category 5 F 5 non-null object dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(1) memory usage: 429.0+ bytes
We note that C is a date object, E is a category object, and F is a
text/string object. pandas has excellent time series capabilities
(having origins in FinTech), and the TimeStamp
function
creates DateTime objects which can be queried and manipulated in Python.
We'll describe category data in the next section.
You can also create a DataFrame
where each column is
composed of composite objects, like lists and dicts, as well. This might
have limited value in some settings but may be useful in others. In
particular, this allows capabilities like the list-column construct in R Tibbles. For example,
pd.DataFrame({'list' :[[1,2],[3,4],[5,6]], 'tuple' : [('a','b'), ('c','d'), ('e','f')], 'set' : {'A','B','C'}, {'D','E'}, {'F'}, 'dicts' : {'A': [1,2,3]}, {'B':[5,6,8]}, {'C': [3,9]}})
list tuple set dicts 0 [1, 2] (a, b) {B, A, C} {'A': [1, 2, 3]} 1 [3, 4] (c, d) {E, D} {'B': [5, 6, 8]} 2 [5, 6] (e, f) {F} {'C': [3, 9]}
Working with a DataFrame
You can extract particular columns of a DataFrame
by name
df['E']
0 yes 1 no 2 no 3 yes 4 no Name: E, dtype: category Categories (2, object): [no, yes]
df['B']
0 0.958092 1 0.883201 2 0.295432 3 0.512376 4 0.088702 Name: B, dtype: float64
There is also a shortcut for accessing single columns, using Python's dot (
.
) notation.
df.B
0 0.958092 1 0.883201 2 0.295432 3 0.512376 4 0.088702 Name: B, dtype: float64
This notation can be more convenient if we need to perform operations on a single column. If we want to extract multiple columns, this notation will not work. Also, if we want to create new columns or replace existing columns, we need to use the array notation with the column name in quotes.
Let's look at slicing a DataFrame
Extracting rows and columns
There are two extractor functions in pandas
:
loc
extracts by label (index label, column label, a slice of labels, etc.iloc
extracts by index (integers, slice objects, etc.
df2 = pd.DataFrame(rng.randint(0,10, (5,4)), index = ['a','b','c','d','e'], columns = ['one','two','three','four']) df2
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
First, let's see what naively slicing this DataFrame
does.
df2['one']
a 5 b 9 c 8 d 5 e 6 Name: one, dtype: int64
Ok, that works. It grabs one column from the dataset. How about the dot notation?
df2.one
a 5 b 9 c 8 d 5 e 6 Name: one, dtype: int64
Let's see what this produces.
type(df2.one)
<class 'pandas.core.series.Series'>
So this is a series, so we can potentially do slicing of this series.
df2.one['b']
9
df2.one['b':'d']
b 9 c 8 d 5 Name: one, dtype: int64
df2.one[:3]
a 5 b 9 c 8 Name: one, dtype: int64
Ok, so we have all the Series
slicing available. The
problem here is in semantics, in that we are grabbing one column and
then slicing the rows. That doesn't quite work with our sense that a DataFrame
is a rectangle with rows and columns, and we tend to think of rows, then columns.
Let's see if we can do column slicing with this.
df2[:'two']
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
That's not what we want, of course. It's giving back the entire data frame. We'll come back to this.
df2[['one','three']]
one three a 5 2 b 9 0 c 8 3 d 5 7 e 6 8
That works correctly, though. We can give a list of column names. Ok.
How about row slices?
#df2['a'] # Doesn't work df2['a':'c']
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3
Ok, that works. It slices rows but includes the largest index, like a Series
but unlike numpy
arrays.
df2[0:2]
one two three four a 5 3 2 8 b 9 3 0 5
Slices by location work too, but use the numpy
slicing rules.
This entire extraction method becomes confusing. Let's simplify
things for this and then move on to more consistent ways to extract
elements of a DataFrame
. Let's agree on two things. If we're going the direct extraction route,
- We will extract single columns of a
DataFrame
with[]
or.
, i.e.,df2['one']
ordf2.one
- We will extract slices of rows of a
DataFrame
using location only, i.e.,df2[:3]
.
For everything else, we'll use two functions, loc
and iloc
.
loc
extracts elements like a matrix, using index and columnsiloc
extracts elements like a matrix, using location
df2.loc[:,'one':'three']
one two three a 5 3 2 b 9 3 0 c 8 4 3 d 5 2 7 e 6 7 8
df2.loc['a':'d',:]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1
df2.loc['b', 'three']
0
So loc
works just like a matrix, but with pandas
slicing rules (include the largest index)
df2.iloc[:,1:4]
two three four a 3 2 8 b 3 0 5 c 4 3 3 d 2 7 1 e 7 8 7
df2.iloc[1:3,:]
one two three four b 9 3 0 5 c 8 4 3 3
df2.iloc[1:3, 1:4]
two three four b 3 0 5 c 4 3 3
iloc
slices like a matrix but uses numpy
slicing conventions (does not include the highest index)
If we want to extract a single element from a dataset, there are two functions available, iat
and at
, with behavior corresponding to iloc
and loc
, respectively.
df2.iat[2,3]
3
df2.at['b','three']
0
Boolean selection
We can also use tests to extract data from a DataFrame
. For example, we can extract only rows where the column labeled one
is greater than 3.
df2[df2.one > 3]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
We can also do composite tests. Here we ask for rows where one
is greater than 3 and three
is less than 9
df2[(df2.one > 3) & (df2.three < 9)]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
query
DataFrame
's have a query
method allowing selection using a Python expression
n = 10 df = pd.DataFrame(np.random.rand(n, 3), columns = list('abc')) df
a b c 0 0.824319 0.095623 0.459736 1 0.200426 0.302179 0.534959 2 0.381447 0.775250 0.654027 3 0.482755 0.512036 0.763493 4 0.679697 0.972007 0.946838 5 0.111028 0.909652 0.136886 6 0.817464 0.499767 0.769067 7 0.921952 0.481407 0.479161 8 0.626052 0.484680 0.518859 9 0.497046 0.045412 0.324195
df[(df.a < df.b) & (df.b < df.c)]
a b c 1 0.200426 0.302179 0.534959 3 0.482755 0.512036 0.763493
We can equivalently write this query as
df.query('(a < b) & (b < c)')
a b c 1 0.200426 0.302179 0.534959 3 0.482755 0.512036 0.763493
Replacing values in a DataFrame
We can replace values within a DataFrame either by position or using a query.
df2
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
df2['one'] = [2,5,2,5,2] df2
one two three four a 2 3 2 8 b 5 3 0 5 c 2 4 3 3 d 5 2 7 1 e 2 7 8 7
df2.iat[2,3] = -9 # missing value df2
one two three four a 2 3 2 8 b 5 3 0 5 c 2 4 3 -9 d 5 2 7 1 e 2 7 8 7
Let's now replace values using replace
, which is more flexible.
df2.replace(0, -9) # replace 0 with -9
one two three four a 2 3 2 8 b 5 3 -9 5 c 2 4 3 -9 d 5 2 7 1 e 2 7 8 7
df2.replace({2: 2.5, 8: 6.5}) # multiple replacements
one two three four a 2.5 3.0 2.5 6.5 b 5.0 3.0 0.0 5.0 c 2.5 4.0 3.0 -9.0 d 5.0 2.5 7.0 1.0 e 2.5 7.0 6.5 7.0
df2.replace({'one': {5: 500}, 'three': {0: -9, 8: 800}}) # different replacements in different columns
one two three four a 2 3 2 8 b 500 3 -9 5 c 2 4 3 -9 d 500 2 7 1 e 2 7 800 7