pandas Dataframes: pandas.DataFrame | Saylor Academy

Data Structures and Types

pandas.DataFrame

The DataFrame object holds a rectangular data set. Each column of a DataFrame is a Series object. This means that each column of a DataFrame must be comprised of data of the same type, but different columns can hold data of different types. This structure is extremely useful in practical data science. The invention of this structure was, in my opinion, transformative in making Python an effective data science tool.

Creating a DataFrame

The DataFrame can be created by importing data, as we saw in the previous section. It can also be created by a few methods within Python.

First, it can be created from a 2-dimensional numpy array.

rng = np.random.RandomState(25) 
d1 = pd.DataFrame(rng.normal(0,1, (4,5))) 
d1

           0         1         2         3         4
0  0.228273  1.026890 -0.839585 -0.591182 -0.956888
1 -0.222326 -0.619915  1.837905 -2.053231  0.868583
2 -0.920734 -0.232312  2.152957 -1.334661  0.076380
3 -1.246089  1.202272 -1.049942  1.056610 -0.419678

You will notice that it creates default column names that are merely the column number, starting from 0. We can also create the column names and row index (similar to the Series index we saw earlier) directly during creation.

d2 = pd.DataFrame(rng.normal(0,1, (4, 5)),  
                   columns = ['A','B','C','D','E'],  
                   index = ['a','b','c','d'])
d2

      A         B         C         D         E
a  2.294842 -2.594487  2.822756  0.680889 -1.577693
b -1.976254  0.533340 -0.290870 -0.513520  1.982626
c  0.226001 -1.839905  1.607671  0.388292  0.399732
d  0.405477  0.217002 -0.633439  0.246622 -1.939546

We could also create a DataFrame from a list of lists, as long as things line up, just as we showed for numpy arrays. However, to me, other ways, including the dict method below, make more sense.

We can change the column names (which can be extracted and replaced with the columns attribute) and the index values (using the index attribute).

 d2.columns

 Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

 d2.columns = pd.Index(['V'+str(i) for i in range(1,6)])
# Index creates the right objects for both column names and row names, 
# which can be extracted and changed with the `index` attribute 
 d2

      V1        V2        V3        V4        V5
a  2.294842 -2.594487  2.822756  0.680889 -1.577693
b -1.976254  0.533340 -0.290870 -0.513520  1.982626
c  0.226001 -1.839905  1.607671  0.388292  0.399732
d  0.405477  0.217002 -0.633439  0.246622 -1.939546

Exercise: Can you explain what I did in the list comprehension above? The key points are understanding str and how I constructed the range.

 d2.index = ['o1','o2','o3','o4'] 
 d2

      V1        V2        V3        V4        V5
o1  2.294842 -2.594487  2.822756  0.680889 -1.577693
o2 -1.976254  0.533340 -0.290870 -0.513520  1.982626
o3  0.226001 -1.839905  1.607671  0.388292  0.399732
o4  0.405477  0.217002 -0.633439  0.246622 -1.939546

You can also extract data from a homogeneous DataFrame to a numpy array

 d1.to_numpy()

 array([[ 0.22827309,  1.0268903 , -0.83958485, -0.59118152, -0.9568883 ],
       [-0.22232569, -0.61991511,  1.83790458, -2.05323076,  0.86858305],
       [-0.92073444, -0.23231186,  2.1529569 , -1.33466147,  0.07637965],
       [-1.24608928,  1.20227231, -1.04994158,  1.05661011, -0.41967767]])

It turns out that you can use to_numpy for a non-homogeneous DataFrame as well. numpy just makes it homogeneous by assigning each column the data type object. This also limits what you can do in numpy with the array and may require changing data types using the astype function. There is some more detail about the object data type in the Python Tools for Data Science (notebook, PDF) document.

The other easy way to create a DataFrame is from a dict object, where each component object is either a list or a numpy array and is homogeneous in type. One exception is if a component is of size 1; then it is repeated to meet the needs of the DataFrame's dimensions

 df = pd.DataFrame({ 
     'A':3., 
    'B':rng.random_sample(5), 
     'C': pd.Timestamp('20200512'), 
     'D': np.array([6] * 5), 
    'E': pd.Categorical(['yes','no','no','yes','no']), 
   'F': 'NIH'}) 
 df

    A         B          C  D    E    F
0  3.0  0.958092 2020-05-12  6  yes  NIH
1  3.0  0.883201 2020-05-12  6   no  NIH
2  3.0  0.295432 2020-05-12  6   no  NIH
3  3.0  0.512376 2020-05-12  6  yes  NIH
4  3.0  0.088702 2020-05-12  6   no  NIH

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       5 non-null      float64       
 1   B       5 non-null      float64       
 2   C       5 non-null      datetime64[ns]
 3   D       5 non-null      int64         
 4   E       5 non-null      category      
 5   F       5 non-null      object        
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 429.0+ bytes

We note that C is a date object, E is a category object, and F is a text/string object. pandas has excellent time series capabilities (having origins in FinTech), and the TimeStamp function creates DateTime objects which can be queried and manipulated in Python. We'll describe category data in the next section.

You can also create a DataFrame where each column is composed of composite objects, like lists and dicts, as well. This might have limited value in some settings but may be useful in others. In particular, this allows capabilities like the list-column construct in R Tibbles. For example,

 pd.DataFrame({'list' :[[1,2],[3,4],[5,6]], 
              'tuple' : [('a','b'), ('c','d'), ('e','f')],  
               'set' : {'A','B','C'}, {'D','E'}, {'F'},  
             'dicts' : {'A': [1,2,3]}, {'B':[5,6,8]}, {'C': [3,9]}})

    list   tuple        set             dicts
0  [1, 2]  (a, b)  {B, A, C}  {'A': [1, 2, 3]}
1  [3, 4]  (c, d)     {E, D}  {'B': [5, 6, 8]}
2  [5, 6]  (e, f)        {F}     {'C': [3, 9]}

Working with a DataFrame

You can extract particular columns of a DataFrame by name

df['E']

0    yes
1     no
2     no
3    yes
4     no
Name: E, dtype: category
Categories (2, object): [no, yes]

df['B']

0    0.958092
1    0.883201
2    0.295432
3    0.512376
4    0.088702
Name: B, dtype: float64

There is also a shortcut for accessing single columns, using Python's dot ( .) notation.

 df.B

0    0.958092
1    0.883201
2    0.295432
3    0.512376
4    0.088702
Name: B, dtype: float64

This notation can be more convenient if we need to perform operations on a single column. If we want to extract multiple columns, this notation will not work. Also, if we want to create new columns or replace existing columns, we need to use the array notation with the column name in quotes.

Let's look at slicing a DataFrame

Extracting rows and columns

There are two extractor functions in pandas:

loc extracts by label (index label, column label, a slice of labels, etc.
iloc extracts by index (integers, slice objects, etc.

 df2 = pd.DataFrame(rng.randint(0,10, (5,4)),   
                    index = ['a','b','c','d','e'], 
                   columns = ['one','two','three','four']) 
 df2

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

First, let's see what naively slicing this DataFrame does.

df2['one']

a    5
b    9
c    8
d    5
e    6
Name: one, dtype: int64

Ok, that works. It grabs one column from the dataset. How about the dot notation?

df2.one

a    5
b    9
c    8
d    5
e    6
Name: one, dtype: int64

Let's see what this produces.

 type(df2.one)

 <class 'pandas.core.series.Series'>

So this is a series, so we can potentially do slicing of this series.

df2.one['b']

df2.one['b':'d']

b    9
c    8
d    5
Name: one, dtype: int64

df2.one[:3]

a    5
b    9
c    8
Name: one, dtype: int64

Ok, so we have all the Series slicing available. The problem here is in semantics, in that we are grabbing one column and then slicing the rows. That doesn't quite work with our sense that a DataFrame is a rectangle with rows and columns, and we tend to think of rows, then columns.

Let's see if we can do column slicing with this.

df2[:'two']

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

That's not what we want, of course. It's giving back the entire data frame. We'll come back to this.

df2[['one','three']]

    one  three
a    5      2
b    9      0
c    8      3
d    5      7
e    6      8

That works correctly, though. We can give a list of column names. Ok.

How about row slices?

 #df2['a'] # Doesn't work 
 df2['a':'c']

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3

Ok, that works. It slices rows but includes the largest index, like a Series but unlike numpy arrays.

 df2[0:2]

    one  two  three  four
a    5    3      2     8
b    9    3      0     5

Slices by location work too, but use the numpy slicing rules.

This entire extraction method becomes confusing. Let's simplify things for this and then move on to more consistent ways to extract elements of a DataFrame. Let's agree on two things. If we're going the direct extraction route,

We will extract single columns of a DataFrame with [] or ., i.e., df2['one'] or df2.one
We will extract slices of rows of a DataFrame using location only, i.e., df2[:3].

For everything else, we'll use two functions, loc and iloc.

loc extracts elements like a matrix, using index and columns
iloc extracts elements like a matrix, using location

 df2.loc[:,'one':'three']

    one  two  three
a    5    3      2
b    9    3      0
c    8    4      3
d    5    2      7
e    6    7      8

 df2.loc['a':'d',:]

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1

 df2.loc['b', 'three']

So loc works just like a matrix, but with pandas slicing rules (include the largest index)

df2.iloc[:,1:4]

    two  three  four
a    3      2     8
b    3      0     5
c    4      3     3
d    2      7     1
e    7      8     7

 df2.iloc[1:3,:]

    one  two  three  four
b    9    3      0     5
c    8    4      3     3

 df2.iloc[1:3, 1:4]

    two  three  four
b    3      0     5
c    4      3     3

iloc slices like a matrix but uses numpy slicing conventions (does not include the highest index)

If we want to extract a single element from a dataset, there are two functions available, iat and at, with behavior corresponding to iloc and loc, respectively.

df2.iat[2,3]

df2.at['b','three']

Boolean selection

We can also use tests to extract data from a DataFrame. For example, we can extract only rows where the column labeled one is greater than 3.

 df2[df2.one > 3]

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

We can also do composite tests. Here we ask for rows where one is greater than 3 and three is less than 9

 df2[(df2.one > 3) & (df2.three < 9)]

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

`query`

DataFrame's have a query method allowing selection using a Python expression

n = 10 df = pd.DataFrame(np.random.rand(n, 3), columns = list('abc'))
df

      a         b         c
0  0.824319  0.095623  0.459736
1  0.200426  0.302179  0.534959
2  0.381447  0.775250  0.654027
3  0.482755  0.512036  0.763493
4  0.679697  0.972007  0.946838
5  0.111028  0.909652  0.136886
6  0.817464  0.499767  0.769067
7  0.921952  0.481407  0.479161
8  0.626052  0.484680  0.518859
9  0.497046  0.045412  0.324195

 df[(df.a < df.b) & (df.b < df.c)]

     a         b         c
1  0.200426  0.302179  0.534959
3  0.482755  0.512036  0.763493

We can equivalently write this query as

 df.query('(a < b) & (b < c)')

      a         b         c
1  0.200426  0.302179  0.534959
3  0.482755  0.512036  0.763493

Replacing values in a DataFrame

We can replace values within a DataFrame either by position or using a query.

df2

    one  two  three  four
a    5    3      2     8
b    9    3      0     5
c    8    4      3     3
d    5    2      7     1
e    6    7      8     7

df2['one'] = [2,5,2,5,2]
df2

    one  two  three  four
a    2    3      2     8
b    5    3      0     5
c    2    4      3     3
d    5    2      7     1
e    2    7      8     7

 df2.iat[2,3] = -9 # missing value df2

    one  two  three  four
a    2    3      2     8
b    5    3      0     5
c    2    4      3    -9
d    5    2      7     1
e    2    7      8     7

Let's now replace values using replace, which is more flexible.

 df2.replace(0, -9) # replace 0 with -9

    one  two  three  four
a    2    3      2     8
b    5    3     -9     5
c    2    4      3    -9
d    5    2      7     1
e    2    7      8     7

 df2.replace({2: 2.5, 8: 6.5}) # multiple replacements

    one  two  three  four
a  2.5  3.0    2.5   6.5
b  5.0  3.0    0.0   5.0
c  2.5  4.0    3.0  -9.0
d  5.0  2.5    7.0   1.0
e  2.5  7.0    6.5   7.0

df2.replace({'one': {5: 500}, 'three': {0: -9, 8: 800}}) 
# different replacements in different columns

    one  two  three  four
a    2    3      2     8
b  500    3     -9     5
c    2    4      3    -9
d  500    2      7     1
e    2    7    800     7

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python