pandas Dataframes
Site: | Saylor Academy |
Course: | CS250: Python for Data Science |
Book: | pandas Dataframes |
Printed by: | Guest user |
Date: | Friday, 4 April 2025, 6:55 AM |
Description
The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.
As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv')
will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming
import pandas as pd
has been invoked, you can download the data from the textbook URL as follows
url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
df = pd.read_csv(url)
which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
df.head(10)
which will print out the first 10 rows of the dataframe.
However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:
#read the data from local drive
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.
Introduction
pandas
is the Python Data Analysis package. It allows
for data ingestion, transformation, and cleaning, and creates objects
that can then be passed on to analytic packages like statsmodels
and scikit-learn
for modeling and packages like matplotlib
, seaborn
, and plotly
for visualization.
pandas
is built on top of numpy, so many numpy functions are commonly used in manipulating pandas
objects.
Source: Abhijit Dasgupta, https://www.araastat.com/BIOF085/pandas.html#starting-pandas
This work is licensed under a Creative Commons Attribution 4.0 License.
Starting pandas
As with any Python module, you have to "activate" pandas
by using import
. The "standard" alias for pandas
is pd
. We will also import numpy
, since pandas
uses some numpy
functions in the workflows.
import numpy as np
import pandas as pd
Data Import and Export
Most data sets you will work with are set up in tables, so are rectangular in shape. Think Excel spreadsheets. In pandas
the structure that will hold this kind of data is a DataFrame
. We can read external data into a DataFrame
using one of many read_*
functions. We can also write from a DataFrame
to a variety of formats using to_*
functions. The most common of these are listed below:
Format type | Description | reader | writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
Excel | read_excel | to_excel | |
text | JSON | read_json | to_json |
binary | Feather | read_feather | to_feather |
binary | SAS | read_sas | |
SQL | SQL | read_sql | to_sql |
We'll start by reading in the mtcars
dataset stored as a CSV file
pd.read_csv('data/mtcars.csv')
make mpg cyl disp hp ... qsec vs am gear carb 0 Mazda RX4 21.0 6 160.0 110 ... 16.46 0 1 4 4 1 Mazda RX4 Wag 21.0 6 160.0 110 ... 17.02 0 1 4 4 2 Datsun 710 22.8 4 108.0 93 ... 18.61 1 1 4 1 3 Hornet 4 Drive 21.4 6 258.0 110 ... 19.44 1 0 3 1 4 Hornet Sportabout 18.7 8 360.0 175 ... 17.02 0 0 3 2 5 Valiant 18.1 6 225.0 105 ... 20.22 1 0 3 1 6 Duster 360 14.3 8 360.0 245 ... 15.84 0 0 3 4 7 Merc 240D 24.4 4 146.7 62 ... 20.00 1 0 4 2 8 Merc 230 22.8 4 140.8 95 ... 22.90 1 0 4 2 9 Merc 280 19.2 6 167.6 123 ... 18.30 1 0 4 4 10 Merc 280C 17.8 6 167.6 123 ... 18.90 1 0 4 4 11 Merc 450SE 16.4 8 275.8 180 ... 17.40 0 0 3 3 12 Merc 450SL 17.3 8 275.8 180 ... 17.60 0 0 3 3 13 Merc 450SLC 15.2 8 275.8 180 ... 18.00 0 0 3 3 14 Cadillac Fleetwood 10.4 8 472.0 205 ... 17.98 0 0 3 4 15 Lincoln Continental 10.4 8 460.0 215 ... 17.82 0 0 3 4 16 Chrysler Imperial 14.7 8 440.0 230 ... 17.42 0 0 3 4 17 Fiat 128 32.4 4 78.7 66 ... 19.47 1 1 4 1 18 Honda Civic 30.4 4 75.7 52 ... 18.52 1 1 4 2 19 Toyota Corolla 33.9 4 71.1 65 ... 19.90 1 1 4 1 20 Toyota Corona 21.5 4 120.1 97 ... 20.01 1 0 3 1 21 Dodge Challenger 15.5 8 318.0 150 ... 16.87 0 0 3 2 22 AMC Javelin 15.2 8 304.0 150 ... 17.30 0 0 3 2 23 Camaro Z28 13.3 8 350.0 245 ... 15.41 0 0 3 4 24 Pontiac Firebird 19.2 8 400.0 175 ... 17.05 0 0 3 2 25 Fiat X1-9 27.3 4 79.0 66 ... 18.90 1 1 4 1 26 Porsche 914-2 26.0 4 120.3 91 ... 16.70 0 1 5 2 27 Lotus Europa 30.4 4 95.1 113 ... 16.90 1 1 5 2 28 Ford Pantera L 15.8 8 351.0 264 ... 14.50 0 1 5 4 29 Ferrari Dino 19.7 6 145.0 175 ... 15.50 0 1 5 6 30 Maserati Bora 15.0 8 301.0 335 ... 14.60 0 1 5 8 31 Volvo 142E 21.4 4 121.0 109 ... 18.60 1 1 4 2 [32 rows x 12 columns]
This just prints out the data, but then it's lost. To use this data, we have to give it a name, so it's stored in Python's memory
mtcars = pd.read_csv('data/mtcars.csv')
One of the big differences between a spreadsheet program and a programming language from the data science perspective is that you have to load data into the programming language. It's not "just there" like Excel. This is a good thing since it allows the common functionality of the programming language to work across multiple data sets and keeps the original data set pristine. Excel users can run into problems and corrupt their data if they are not careful.
If we wanted to write this data set back out into an Excel file, say, we could do
mtcars.to_excel('data/mtcars.xlsx')
You may get an error if you don't have the openpyxl
package installed. You can easily install it from the Anaconda prompt using conda install openpyxl
and following the prompts.
Exploring a Data Set
We would like to get some idea about this data set. There are a bunch of functions linked to the DataFrame
object that help us in this. First, we will use head
to see the first 8 rows of this data set
mtcars.head(8)
make mpg cyl disp hp ... qsec vs am gear carb 0 Mazda RX4 21.0 6 160.0 110 ... 16.46 0 1 4 4 1 Mazda RX4 Wag 21.0 6 160.0 110 ... 17.02 0 1 4 4 2 Datsun 710 22.8 4 108.0 93 ... 18.61 1 1 4 1 3 Hornet 4 Drive 21.4 6 258.0 110 ... 19.44 1 0 3 1 4 Hornet Sportabout 18.7 8 360.0 175 ... 17.02 0 0 3 2 5 Valiant 18.1 6 225.0 105 ... 20.22 1 0 3 1 6 Duster 360 14.3 8 360.0 245 ... 15.84 0 0 3 4 7 Merc 240D 24.4 4 146.7 62 ... 20.00 1 0 4 2 [8 rows x 12 columns]
This is our first look into this data. We notice a few things. Each column has a name, and each row has an index, starting at 0.
If you're interested in the last N rows, there is a corresponding tail
function
Let's look at the data types of each of the columns
mtcars.dtypes
make object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 dtype: object
This tells us that some of the variables, like mpg
and disp
, are floating point (decimal) numbers, several are integers, and make
is an "object". The dtypes
function borrows from numpy
, where there isn't really a type for character or categorical variables. So most often, when you see "object" in the output of dtypes
, you think it's a character or categorical variable.
We can also look at the data structure in a bit more detail.
mtcars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32 entries, 0 to 31 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 make 32 non-null object 1 mpg 32 non-null float64 2 cyl 32 non-null int64 3 disp 32 non-null float64 4 hp 32 non-null int64 5 drat 32 non-null float64 6 wt 32 non-null float64 7 qsec 32 non-null float64 8 vs 32 non-null int64 9 am 32 non-null int64 10 gear 32 non-null int64 11 carb 32 non-null int64 dtypes: float64(5), int64(6), object(1) memory usage: 3.1+ KB
This tells us that this is indeed a DataFrame
with 12
columns, each with 32 valid observations. Each row has an index value
ranging from 0 to 11. We also get the approximate size of this object in
memory.
You can also quickly find the number of rows and columns of a data set by using shape
, which is borrowed from numpy.
mtcars.shape
(32, 12)
More generally, we can get a summary of each variable using the describe
function
mtcars.describe()
mpg cyl disp ... am gear carb count 32.000000 32.000000 32.000000 ... 32.000000 32.000000 32.0000 mean 20.090625 6.187500 230.721875 ... 0.406250 3.687500 2.8125 std 6.026948 1.785922 123.938694 ... 0.498991 0.737804 1.6152 min 10.400000 4.000000 71.100000 ... 0.000000 3.000000 1.0000 25% 15.425000 4.000000 120.825000 ... 0.000000 3.000000 2.0000 50% 19.200000 6.000000 196.300000 ... 0.000000 4.000000 2.0000 75% 22.800000 8.000000 326.000000 ... 1.000000 4.000000 4.0000 max 33.900000 8.000000 472.000000 ... 1.000000 5.000000 8.0000 [8 rows x 11 columns]
These are usually the first steps in exploring the data.
Data Structures and Types
pandas has two main data types: Series
and DataFrame
. These are analogous to vectors and matrices in that a Series
is 1-dimensional while a DataFrame
is 2-dimensional.
pandas.Series
The Series
object holds data from a single input
variable and is required, much like numpy arrays, to be homogeneous in
type. You can create Series
objects from lists or numpy arrays quite easily
s = pd.Series([1,3,5,np.nan, 9, 13]) s
0 1.0 1 3.0 2 5.0 3 NaN 4 9.0 5 13.0 dtype: float64
s2 = pd.Series(np.arange(1,20)) s2
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 dtype: int64
You can access elements of a Series
much like a dict
s2[4]
5
There is no requirement that the index of a Series
has to be numeric. It can be any kind of scalar object
s3 = pd.Series(np.random.normal(0,1, (5,)), index = ['a','b','c','d','e']) s3
a -0.283473 b 0.157530 c 1.051739 d 0.859905 e 1.178951 dtype: float64
s3['d']
0.859904696094078
s3['a':'d']
a -0.283473 b 0.157530 c 1.051739 d 0.859905 dtype: float64
Well, slicing worked, but it gave us something different than expected. It gave us both the start and end of the slice, which is unlike what we've encountered so far!!
It turns out that in pandas
, slicing by index actually does this. It is a discrepancy from numpy
and Python in general that we have to be careful about.
You can extract the actual values into a numpy array
s3.to_numpy()
array([-0.28347282, 0.1575304 , 1.05173885, 0.8599047 , 1.17895111])
In fact, you'll see that much of pandas
' structures are built on top of numpy
arrays. This is a good thing since you can take advantage of the
powerful numpy functions that are built for fast, efficient scientific
computing.
Making the point about slicing again,
s3.to_numpy()[0:3]
array([-0.28347282, 0.1575304 , 1.05173885])
This is different from index-based slicing done earlier.
pandas.DataFrame
The DataFrame
object holds a rectangular data set. Each column of a DataFrame
is a Series
object. This means that each column of a DataFrame
must be comprised of data of the same type, but different columns can
hold data of different types. This structure is extremely useful in
practical data science. The invention of this structure was, in my
opinion, transformative in making Python an effective data science tool.
Creating a DataFrame
The DataFrame
can be created by importing data, as we saw in the previous section. It can also be created by a few methods within Python.
First, it can be created from a 2-dimensional numpy
array.
rng = np.random.RandomState(25) d1 = pd.DataFrame(rng.normal(0,1, (4,5))) d1
0 1 2 3 4 0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 1 -0.222326 -0.619915 1.837905 -2.053231 0.868583 2 -0.920734 -0.232312 2.152957 -1.334661 0.076380 3 -1.246089 1.202272 -1.049942 1.056610 -0.419678
You will notice that it creates default column names that are merely
the column number, starting from 0. We can also create the column names
and row index (similar to the Series
index we saw earlier) directly during creation.
d2 = pd.DataFrame(rng.normal(0,1, (4, 5)), columns = ['A','B','C','D','E'], index = ['a','b','c','d']) d2
A B C D E a 2.294842 -2.594487 2.822756 0.680889 -1.577693 b -1.976254 0.533340 -0.290870 -0.513520 1.982626 c 0.226001 -1.839905 1.607671 0.388292 0.399732 d 0.405477 0.217002 -0.633439 0.246622 -1.939546
We could also create a
DataFrame
from a list of lists, as long as things line up, just as we showed fornumpy
arrays. However, to me, other ways, including thedict
method below, make more sense.
We can change the column names (which can be extracted and replaced with the columns
attribute) and the index values (using the index
attribute).
d2.columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
d2.columns = pd.Index(['V'+str(i) for i in range(1,6)]) # Index creates the right objects for both column names and row names, # which can be extracted and changed with the `index` attribute d2
V1 V2 V3 V4 V5 a 2.294842 -2.594487 2.822756 0.680889 -1.577693 b -1.976254 0.533340 -0.290870 -0.513520 1.982626 c 0.226001 -1.839905 1.607671 0.388292 0.399732 d 0.405477 0.217002 -0.633439 0.246622 -1.939546
Exercise: Can you explain what I did in the list comprehension above? The key points are understanding str
and how I constructed the range
.
d2.index = ['o1','o2','o3','o4'] d2
V1 V2 V3 V4 V5 o1 2.294842 -2.594487 2.822756 0.680889 -1.577693 o2 -1.976254 0.533340 -0.290870 -0.513520 1.982626 o3 0.226001 -1.839905 1.607671 0.388292 0.399732 o4 0.405477 0.217002 -0.633439 0.246622 -1.939546
You can also extract data from a homogeneous DataFrame
to a numpy
array
d1.to_numpy()
array([[ 0.22827309, 1.0268903 , -0.83958485, -0.59118152, -0.9568883 ], [-0.22232569, -0.61991511, 1.83790458, -2.05323076, 0.86858305], [-0.92073444, -0.23231186, 2.1529569 , -1.33466147, 0.07637965], [-1.24608928, 1.20227231, -1.04994158, 1.05661011, -0.41967767]])
It turns out that you can use
to_numpy
for a non-homogeneousDataFrame
as well.numpy
just makes it homogeneous by assigning each column the data typeobject
. This also limits what you can do innumpy
with the array and may require changing data types using theastype
function. There is some more detail about theobject
data type in the Python Tools for Data Science (notebook, PDF) document.
The other easy way to create a DataFrame
is from a dict
object, where each component object is either a list or a numpy array and is homogeneous in type. One exception is if a component is of size
1; then it is repeated to meet the needs of the DataFrame
's dimensions
df = pd.DataFrame({ 'A':3., 'B':rng.random_sample(5), 'C': pd.Timestamp('20200512'), 'D': np.array([6] * 5), 'E': pd.Categorical(['yes','no','no','yes','no']), 'F': 'NIH'}) df
A B C D E F 0 3.0 0.958092 2020-05-12 6 yes NIH 1 3.0 0.883201 2020-05-12 6 no NIH 2 3.0 0.295432 2020-05-12 6 no NIH 3 3.0 0.512376 2020-05-12 6 yes NIH 4 3.0 0.088702 2020-05-12 6 no NIH
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null float64 1 B 5 non-null float64 2 C 5 non-null datetime64[ns] 3 D 5 non-null int64 4 E 5 non-null category 5 F 5 non-null object dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(1) memory usage: 429.0+ bytes
We note that C is a date object, E is a category object, and F is a
text/string object. pandas has excellent time series capabilities
(having origins in FinTech), and the TimeStamp
function
creates DateTime objects which can be queried and manipulated in Python.
We'll describe category data in the next section.
You can also create a DataFrame
where each column is
composed of composite objects, like lists and dicts, as well. This might
have limited value in some settings but may be useful in others. In
particular, this allows capabilities like the list-column construct in R Tibbles. For example,
pd.DataFrame({'list' :[[1,2],[3,4],[5,6]], 'tuple' : [('a','b'), ('c','d'), ('e','f')], 'set' : {'A','B','C'}, {'D','E'}, {'F'}, 'dicts' : {'A': [1,2,3]}, {'B':[5,6,8]}, {'C': [3,9]}})
list tuple set dicts 0 [1, 2] (a, b) {B, A, C} {'A': [1, 2, 3]} 1 [3, 4] (c, d) {E, D} {'B': [5, 6, 8]} 2 [5, 6] (e, f) {F} {'C': [3, 9]}
Working with a DataFrame
You can extract particular columns of a DataFrame
by name
df['E']
0 yes 1 no 2 no 3 yes 4 no Name: E, dtype: category Categories (2, object): [no, yes]
df['B']
0 0.958092 1 0.883201 2 0.295432 3 0.512376 4 0.088702 Name: B, dtype: float64
There is also a shortcut for accessing single columns, using Python's dot (
.
) notation.
df.B
0 0.958092 1 0.883201 2 0.295432 3 0.512376 4 0.088702 Name: B, dtype: float64
This notation can be more convenient if we need to perform operations on a single column. If we want to extract multiple columns, this notation will not work. Also, if we want to create new columns or replace existing columns, we need to use the array notation with the column name in quotes.
Let's look at slicing a DataFrame
Extracting rows and columns
There are two extractor functions in pandas
:
loc
extracts by label (index label, column label, a slice of labels, etc.iloc
extracts by index (integers, slice objects, etc.
df2 = pd.DataFrame(rng.randint(0,10, (5,4)), index = ['a','b','c','d','e'], columns = ['one','two','three','four']) df2
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
First, let's see what naively slicing this DataFrame
does.
df2['one']
a 5 b 9 c 8 d 5 e 6 Name: one, dtype: int64
Ok, that works. It grabs one column from the dataset. How about the dot notation?
df2.one
a 5 b 9 c 8 d 5 e 6 Name: one, dtype: int64
Let's see what this produces.
type(df2.one)
<class 'pandas.core.series.Series'>
So this is a series, so we can potentially do slicing of this series.
df2.one['b']
9
df2.one['b':'d']
b 9 c 8 d 5 Name: one, dtype: int64
df2.one[:3]
a 5 b 9 c 8 Name: one, dtype: int64
Ok, so we have all the Series
slicing available. The
problem here is in semantics, in that we are grabbing one column and
then slicing the rows. That doesn't quite work with our sense that a DataFrame
is a rectangle with rows and columns, and we tend to think of rows, then columns.
Let's see if we can do column slicing with this.
df2[:'two']
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
That's not what we want, of course. It's giving back the entire data frame. We'll come back to this.
df2[['one','three']]
one three a 5 2 b 9 0 c 8 3 d 5 7 e 6 8
That works correctly, though. We can give a list of column names. Ok.
How about row slices?
#df2['a'] # Doesn't work df2['a':'c']
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3
Ok, that works. It slices rows but includes the largest index, like a Series
but unlike numpy
arrays.
df2[0:2]
one two three four a 5 3 2 8 b 9 3 0 5
Slices by location work too, but use the numpy
slicing rules.
This entire extraction method becomes confusing. Let's simplify
things for this and then move on to more consistent ways to extract
elements of a DataFrame
. Let's agree on two things. If we're going the direct extraction route,
- We will extract single columns of a
DataFrame
with[]
or.
, i.e.,df2['one']
ordf2.one
- We will extract slices of rows of a
DataFrame
using location only, i.e.,df2[:3]
.
For everything else, we'll use two functions, loc
and iloc
.
loc
extracts elements like a matrix, using index and columnsiloc
extracts elements like a matrix, using location
df2.loc[:,'one':'three']
one two three a 5 3 2 b 9 3 0 c 8 4 3 d 5 2 7 e 6 7 8
df2.loc['a':'d',:]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1
df2.loc['b', 'three']
0
So loc
works just like a matrix, but with pandas
slicing rules (include the largest index)
df2.iloc[:,1:4]
two three four a 3 2 8 b 3 0 5 c 4 3 3 d 2 7 1 e 7 8 7
df2.iloc[1:3,:]
one two three four b 9 3 0 5 c 8 4 3 3
df2.iloc[1:3, 1:4]
two three four b 3 0 5 c 4 3 3
iloc
slices like a matrix but uses numpy
slicing conventions (does not include the highest index)
If we want to extract a single element from a dataset, there are two functions available, iat
and at
, with behavior corresponding to iloc
and loc
, respectively.
df2.iat[2,3]
3
df2.at['b','three']
0
Boolean selection
We can also use tests to extract data from a DataFrame
. For example, we can extract only rows where the column labeled one
is greater than 3.
df2[df2.one > 3]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
We can also do composite tests. Here we ask for rows where one
is greater than 3 and three
is less than 9
df2[(df2.one > 3) & (df2.three < 9)]
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
query
DataFrame
's have a query
method allowing selection using a Python expression
n = 10 df = pd.DataFrame(np.random.rand(n, 3), columns = list('abc')) df
a b c 0 0.824319 0.095623 0.459736 1 0.200426 0.302179 0.534959 2 0.381447 0.775250 0.654027 3 0.482755 0.512036 0.763493 4 0.679697 0.972007 0.946838 5 0.111028 0.909652 0.136886 6 0.817464 0.499767 0.769067 7 0.921952 0.481407 0.479161 8 0.626052 0.484680 0.518859 9 0.497046 0.045412 0.324195
df[(df.a < df.b) & (df.b < df.c)]
a b c 1 0.200426 0.302179 0.534959 3 0.482755 0.512036 0.763493
We can equivalently write this query as
df.query('(a < b) & (b < c)')
a b c 1 0.200426 0.302179 0.534959 3 0.482755 0.512036 0.763493
Replacing values in a DataFrame
We can replace values within a DataFrame either by position or using a query.
df2
one two three four a 5 3 2 8 b 9 3 0 5 c 8 4 3 3 d 5 2 7 1 e 6 7 8 7
df2['one'] = [2,5,2,5,2] df2
one two three four a 2 3 2 8 b 5 3 0 5 c 2 4 3 3 d 5 2 7 1 e 2 7 8 7
df2.iat[2,3] = -9 # missing value df2
one two three four a 2 3 2 8 b 5 3 0 5 c 2 4 3 -9 d 5 2 7 1 e 2 7 8 7
Let's now replace values using replace
, which is more flexible.
df2.replace(0, -9) # replace 0 with -9
one two three four a 2 3 2 8 b 5 3 -9 5 c 2 4 3 -9 d 5 2 7 1 e 2 7 8 7
df2.replace({2: 2.5, 8: 6.5}) # multiple replacements
one two three four a 2.5 3.0 2.5 6.5 b 5.0 3.0 0.0 5.0 c 2.5 4.0 3.0 -9.0 d 5.0 2.5 7.0 1.0 e 2.5 7.0 6.5 7.0
df2.replace({'one': {5: 500}, 'three': {0: -9, 8: 800}}) # different replacements in different columns
one two three four a 2 3 2 8 b 500 3 -9 5 c 2 4 3 -9 d 500 2 7 1 e 2 7 800 7
Categorical Data
pandas
provides a Categorical
function and a category
object type to Python. This type is analogous to the factor
data type in R. It is meant to address categorical or discrete
variables, where we need to use them in analyses. Categorical variables
typically take on a small number of unique values, like gender, blood
type, country of origin, race, etc.
You can create categorical Series
in a couple of ways:
s = pd.Series(['a','b','c'], dtype='category')
df = pd.DataFrame({ 'A':3., 'B':rng.random_sample(5), 'C': pd.Timestamp('20200512'), 'D': np.array([6] * 5), 'E': pd.Categorical(['yes','no','no','yes','no']), 'F': 'NIH'}) df['F'].astype('category')
0 NIH 1 NIH 2 NIH 3 NIH 4 NIH Name: F, dtype: category Categories (1, object): [NIH]
You can also create DataFrame
's where each column is categorical
df = pd.DataFrame({'A': list('abcd'), 'B': list('bdca')}) df_cat = df.astype('category') df_cat.dtypes
A category B category dtype: object
You can explore categorical data in a variety of ways
df_cat['A'].describe()
count 4 unique 4 top d freq 1 Name: A, dtype: object
df['A'].value_counts()
d 1 b 1 a 1 c 1 Name: A, dtype: int64
One issue with categories is that if a particular level of a category is not seen before, it can create an error. So you can pre-specify the categories you expect
df_cat['B'] = pd.Categorical(list('aabb'), categories = ['a','b','c','d']) df_cat['B'].value_counts()
b 2 a 2 d 0 c 0 Name: B, dtype: int64
Re-organizing categories
In categorical data, there is often the concept of a "first" or "reference" category and an ordering of categories. This tends to be
important in both visualization as well as in regression modeling. Both
aspects of a category can be addressed using the reorder_categories
function.
In our earlier example, we can see that the A
variable has 4 categories, with the "first" category being "a".
df_cat.A
0 a 1 b 2 c 3 d Name: A, dtype: category Categories (4, object): [a, b, c, d]
Suppose we want to change this ordering to the reverse ordering, where "d" is the "first" category, and then it goes in reverse order.
df_cat['A'] = df_cat.A.cat.reorder_categories(['d','c','b','a'])
df_cat.A
0 a 1 b 2 c 3 d Name: A, dtype: category Categories (4, object): [d, c, b, a]
Missing Data
Both numpy
and pandas
allow for missing values, which are a reality in data science. The missing values are coded as np.nan
. Let's create some data and force some missing values
df = pd.DataFrame(np.random.randn(5, 3), index = ['a','c','e', 'f','g'], columns = ['one','two','three']) # pre-specify index and column names
df['four'] = 20 # add a column named "four", which will all be 20
df['five'] = df['one'] > 0
df
one two three four five a -0.706987 -0.821679 1.441257 20 False c 1.297128 0.501395 0.572570 20 True e -0.761507 1.469939 0.400255 20 False f -0.910821 0.449404 0.588879 20 False g -0.718350 -0.364237 1.793386 20 False
df2 = df.reindex(['a','b','c','d','e','f','g'])
df2.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cbd6040>
The code above is creating new blank rows based on the new index values, some of which are present in the existing data and some of which are missing.
We can create masks of the data indicating where missing values reside in a data set.
df2.isna()
one two three four five a False False False False False b True True True True True c False False False False False d True True True True True e False False False False False f False False False False False g False False False False False
df2['one'].notna()
a True b False c True d False e True f True g True Name: one, dtype: bool
We can obtain complete data by dropping any row that has any missing value. This is called complete case analysis, and you should be very careful using it. It is only valid if we believe that the missingness is missing at random and not related to some characteristic of the data or the data gathering process.
df2.dropna(how='any')
one two three four five a -0.706987 -0.821679 1.441257 20.0 False c 1.297128 0.501395 0.572570 20.0 True e -0.761507 1.469939 0.400255 20.0 False f -0.910821 0.449404 0.588879 20.0 False g -0.718350 -0.364237 1.793386 20.0 False
You can also fill in, or impute, missing values. This can be done using a single value.
out1 = df2.fillna(value = 5)
out1.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cf5fca0>
or a computed value like a column mean
df3 = df2.copy() df3 = df3.select_dtypes(exclude=[object]) # remove non-numeric columns out2 = df3.fillna(df3.mean()) # df3.mean() computes column-wise means
out2.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cf830d0>
You can also impute based on the principle of last value carried forward, which is common in time series. This means that the missing value is imputed with the previous recorded value.
out3 = df2.fillna(method = 'ffill') # Fill forward out3.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cbeca60>
out4 = df2.fillna(method = 'bfill') # Fill backward out4.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11c