Data Structures and Types

pandas.DataFrame

The DataFrame object holds a rectangular data set. Each column of a DataFrame is a Series object. This means that each column of a DataFrame must be comprised of data of the same type, but different columns can hold data of different types. This structure is extremely useful in practical data science. The invention of this structure was, in my opinion, transformative in making Python an effective data science tool.


Creating a DataFrame

The DataFrame can be created by importing data, as we saw in the previous section. It can also be created by a few methods within Python.

First, it can be created from a 2-dimensional numpy array.

rng = np.random.RandomState(25) 
d1 = pd.DataFrame(rng.normal(0,1, (4,5))) 
d1
           0         1         2         3         4
0  0.228273  1.026890 -0.839585 -0.591182 -0.956888
1 -0.222326 -0.619915  1.837905 -2.053231  0.868583
2 -0.920734 -0.232312  2.152957 -1.334661  0.076380
3 -1.246089  1.202272 -1.049942  1.056610 -0.419678

You will notice that it creates default column names that are merely the column number, starting from 0. We can also create the column names and row index (similar to the Series index we saw earlier) directly during creation.

d2 = pd.DataFrame(rng.normal(0,1, (4, 5)),  
                   columns = ['A','B','C','D','E'],  
                   index = ['a','b','c','d'])
d2 
      A         B         C         D         E
a  2.294842 -2.594487  2.822756  0.680889 -1.577693
b -1.976254  0.533340 -0.290870 -0.513520  1.982626
c  0.226001 -1.839905  1.607671  0.388292  0.399732
d  0.405477  0.217002 -0.633439  0.246622 -1.939546

We could also create a DataFrame from a list of lists, as long as things line up, just as we showed for numpy arrays. However, to me, other ways, including the dict method below, make more sense.

We can change the column names (which can be extracted and replaced with the columns attribute) and the index values (using the index attribute).

 d2.columns 
 Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
 d2.columns = pd.Index(['V'+str(i) for i in range(1,6)])
# Index creates the right objects for both column names and row names, 
# which can be extracted and changed with the `index` attribute 
 d2 
      V1        V2        V3        V4        V5
a  2.294842 -2.594487  2.822756  0.680889 -1.577693
b -1.976254  0.533340 -0.290870 -0.513520  1.982626
c  0.226001 -1.839905  1.607671  0.388292  0.399732
d  0.405477  0.217002 -0.633439  0.246622 -1.939546


Exercise: Can you explain what I did in the list comprehension above? The key points are understanding str and how I constructed the range.

 d2.index = ['o1','o2','o3','o4'] 
 d2 
      V1        V2        V3        V4        V5
o1  2.294842 -2.594487  2.822756  0.680889 -1.577693
o2 -1.976254  0.533340 -0.290870 -0.513520  1.982626
o3  0.226001 -1.839905  1.607671  0.388292  0.399732
o4  0.405477  0.217002 -0.633439  0.246622 -1.939546

You can also extract data from a homogeneous DataFrame to a numpy array

 d1.to_numpy() 
 array([[ 0.22827309,  1.0268903 , -0.83958485, -0.59118152, -0.9568883 ],
       [-0.22232569, -0.61991511,  1.83790458, -2.05323076,  0.86858305],
       [-0.92073444, -0.23231186,  2.1529569 , -1.33466147,  0.07637965],
       [-1.24608928,  1.20227231, -1.04994158,  1.05661011, -0.41967767]])

It turns out that you can use to_numpy for a non-homogeneous DataFrame as well. numpy just makes it homogeneous by assigning each column the data type object. This also limits what you can do in numpy with the array and may require changing data types using the astype function. There is some more detail about the object data type in the Python Tools for Data Science (notebook, PDF) document.

The other easy way to create a DataFrame is from a dict object, where each component object is either a list or a numpy array and is homogeneous in type. One exception is if a component is of size 1; then it is repeated to meet the needs of the DataFrame's dimensions

 df = pd.DataFrame({ 
     'A':3., 
    'B':rng.random_sample(5), 
     'C': pd.Timestamp('20200512'), 
     'D': np.array([6] * 5), 
    'E': pd.Categorical(['yes','no','no','yes','no']), 
   'F': 'NIH'}) 
 df 
    A         B          C  D    E    F
0  3.0  0.958092 2020-05-12  6  yes  NIH
1  3.0  0.883201 2020-05-12  6   no  NIH
2  3.0  0.295432 2020-05-12  6   no  NIH
3  3.0  0.512376 2020-05-12  6  yes  NIH
4  3.0  0.088702 2020-05-12  6   no  NIH
df.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       5 non-null      float64       
 1   B       5 non-null      float64       
 2   C       5 non-null      datetime64[ns]
 3   D       5 non-null      int64         
 4   E       5 non-null      category      
 5   F       5 non-null      object        
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 429.0+ bytes

We note that C is a date object, E is a category object, and F is a text/string object. pandas has excellent time series capabilities (having origins in FinTech), and the TimeStamp function creates DateTime objects which can be queried and manipulated in Python. We'll describe category data in the next section.

You can also create a DataFrame where each column is composed of composite objects, like lists and dicts, as well. This might have limited value in some settings but may be useful in others. In particular, this allows capabilities like the list-column construct in R Tibbles. For example,

 pd.DataFrame({'list' :[[1,2],[3,4],[5,6]], 
              'tuple' : [('a','b'), ('c','d'), ('e','f')],  
               'set' : {'A','B','C'}, {'D','E'}, {'F'},  
             'dicts' : {'A': [1,2,3]}, {'B':[5,6,8]}, {'C': [3,9]}}) 
    list   tuple        set             dicts
0  [1, 2]  (a, b)  {B, A, C}  {'A': [1, 2, 3]}
1  [3, 4]  (c, d)     {E, D}  {'B': [5, 6, 8]}
2  [5, 6]  (e, f)        {F}     {'C': [3, 9]}


Working with a DataFrame

You can extract particular columns of a DataFrame by name

df['E'] 
0    yes
1     no
2     no
3    yes
4     no
Name: E, dtype: category
Categories (2, object): [no, yes]
df['B'] 
0    0.958092
1    0.883201
2    0.295432
3    0.512376
4    0.088702
Name: B, dtype: float64

There is also a shortcut for accessing single columns, using Python's dot ( .) notation.

 df.B 
0    0.958092
1    0.883201
2    0.295432
3    0.512376
4    0.088702
Name: B, dtype: float64

This notation can be more convenient if we need to perform operations on a single column. If we want to extract multiple columns, this notation will not work. Also, if we want to create new columns or replace existing columns, we need to use the array notation with the column name in quotes.

Let's look at slicing a DataFrame


Extracting rows and columns

There are two extractor functions in pandas:

  • loc extracts by label (index label, column label, a slice of labels, etc.
  • iloc extracts by index (integers, slice objects, etc.
 df2 = pd.DataFrame(rng.randint(0,10, (5,4)),   
                    index = ['a','b','c','d','e'], 
                   columns = ['one','two','three','four']) 
 df2 
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

First, let's see what naively slicing this DataFrame does.

df2['one'] 
a    5
b    9
c    8
d    5
e    6
Name: one, dtype: int64

Ok, that works. It grabs one column from the dataset. How about the dot notation?

df2.one 
a    5
b    9
c    8
d    5
e    6
Name: one, dtype: int64

Let's see what this produces.

 type(df2.one) 
 <class 'pandas.core.series.Series'>

So this is a series, so we can potentially do slicing of this series.

df2.one['b'] 
9
df2.one['b':'d'] 
b    9
c    8
d    5
Name: one, dtype: int64
df2.one[:3] 
a    5
b    9
c    8
Name: one, dtype: int64

Ok, so we have all the Series slicing available. The problem here is in semantics, in that we are grabbing one column and then slicing the rows. That doesn't quite work with our sense that a DataFrame is a rectangle with rows and columns, and we tend to think of rows, then columns.

Let's see if we can do column slicing with this.

df2[:'two'] 
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

That's not what we want, of course. It's giving back the entire data frame. We'll come back to this.

df2[['one','three']] 
    one  three
a    5      2
b    9      0
c    8      3
d    5      7
e    6      8

That works correctly, though. We can give a list of column names. Ok.

How about row slices?

 #df2['a'] # Doesn't work 
 df2['a':'c']  
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3

Ok, that works. It slices rows but includes the largest index, like a Series but unlike numpy arrays.

 df2[0:2] 
    one  two  three  four
a    5    3      2     8
b    9    3      0     5

Slices by location work too, but use the numpy slicing rules.

This entire extraction method becomes confusing. Let's simplify things for this and then move on to more consistent ways to extract elements of a DataFrame. Let's agree on two things. If we're going the direct extraction route,

  1. We will extract single columns of a DataFrame with [] or ., i.e., df2['one'] or df2.one
  2. We will extract slices of rows of a DataFrame using location only, i.e., df2[:3].

For everything else, we'll use two functions, loc and iloc.

  • loc extracts elements like a matrix, using index and columns
  • iloc extracts elements like a matrix, using location
 df2.loc[:,'one':'three']
    one  two  three
a    5    3      2
b    9    3      0
c    8    4      3
d    5    2      7
e    6    7      8
 df2.loc['a':'d',:]
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
 df2.loc['b', 'three']
 0

So loc works just like a matrix, but with pandas slicing rules (include the largest index)

df2.iloc[:,1:4]
    two  three  four
a    3      2     8
b    3      0     5
c    4      3     3
d    2      7     1
e    7      8     7
 df2.iloc[1:3,:]
    one  two  three  four
b    9    3      0     5
c    8    4      3     3
 df2.iloc[1:3, 1:4]
    two  three  four
b    3      0     5
c    4      3     3

iloc slices like a matrix but uses numpy slicing conventions (does not include the highest index)

If we want to extract a single element from a dataset, there are two functions available, iat and at, with behavior corresponding to iloc and loc, respectively.

df2.iat[2,3]
3
df2.at['b','three']
 0


Boolean selection

We can also use tests to extract data from a DataFrame. For example, we can extract only rows where the column labeled one is greater than 3.

 df2[df2.one > 3]
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

We can also do composite tests. Here we ask for rows where one is greater than 3 and three is less than 9

 df2[(df2.one > 3) & (df2.three < 9)]
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

query

DataFrame's have a query method allowing selection using a Python expression

n = 10 df = pd.DataFrame(np.random.rand(n, 3), columns = list('abc'))
df
      a         b         c
0  0.824319  0.095623  0.459736
1  0.200426  0.302179  0.534959
2  0.381447  0.775250  0.654027
3  0.482755  0.512036  0.763493
4  0.679697  0.972007  0.946838
5  0.111028  0.909652  0.136886
6  0.817464  0.499767  0.769067
7  0.921952  0.481407  0.479161
8  0.626052  0.484680  0.518859
9  0.497046  0.045412  0.324195
 df[(df.a < df.b) & (df.b < df.c)]
     a         b         c
1  0.200426  0.302179  0.534959
3  0.482755  0.512036  0.763493

We can equivalently write this query as

 df.query('(a < b) & (b < c)')
      a         b         c
1  0.200426  0.302179  0.534959
3  0.482755  0.512036  0.763493


Replacing values in a DataFrame

We can replace values within a DataFrame either by position or using a query.

df2
    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7
df2['one'] = [2,5,2,5,2]
df2
    one  two  three  four
a    2    3      2     8
b    5    3      0     5
c    2    4      3     3
d    5    2      7     1
e    2    7      8     7
 df2.iat[2,3] = -9 # missing value df2
    one  two  three  four
a    2    3      2     8
b    5    3      0     5
c    2    4      3    -9
d    5    2      7     1
e    2    7      8     7

Let's now replace values using replace, which is more flexible.

 df2.replace(0, -9) # replace 0 with -9
    one  two  three  four
a    2    3      2     8
b    5    3     -9     5
c    2    4      3    -9
d    5    2      7     1
e    2    7      8     7
 df2.replace({2: 2.5, 8: 6.5}) # multiple replacements
    one  two  three  four
a  2.5  3.0    2.5   6.5
b  5.0  3.0    0.0   5.0
c  2.5  4.0    3.0  -9.0
d  5.0  2.5    7.0   1.0
e  2.5  7.0    6.5   7.0
df2.replace({'one': {5: 500}, 'three': {0: -9, 8: 800}}) 
# different replacements in different columns
    one  two  three  four
a    2    3      2     8
b  500    3     -9     5
c    2    4      3    -9
d  500    2      7     1
e    2    7    800     7