pandas Data Structures
Site: | Saylor Academy |
Course: | CS250: Python for Data Science |
Book: | pandas Data Structures |
Printed by: | Guest user |
Date: | Friday, 4 April 2025, 6:50 AM |
Description
With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations.
Data Transformation
Arithmetic operations
If you have a Series
or DataFrame
that is all numeric, you can add or multiply single numbers to all the elements together.
A = pd.DataFrame(np.random.randn(4,5)) print(A)
0 1 2 3 4 0 0.243690 1.076332 0.950349 1.373858 0.868005 1 2.712568 1.046159 -2.044472 -0.630508 0.945787 2 0.413522 1.938513 -0.195251 0.350383 1.381913 3 -0.927886 -1.882394 -2.163616 -0.142430 1.091522
print(A + 6)
0 1 2 3 4 0 6.243690 7.076332 6.950349 7.373858 6.868005 1 8.712568 7.046159 3.955528 5.369492 6.945787 2 6.413522 7.938513 5.804749 6.350383 7.381913 3 5.072114 4.117606 3.836384 5.857570 7.091522
print(A * -10)
0 1 2 3 4 0 -2.436896 -10.763319 -9.503492 -13.738582 -8.680051 1 -27.125677 -10.461590 20.444719 6.305081 -9.457871 2 -4.135220 -19.385126 1.952507 -3.503830 -13.819127 3 9.278864 18.823936 21.636155 1.424302 -10.915223
If you have two compatible (same dimension) numeric DataFrame
s, you can add, subtract, multiply and divide elementwise
B = pd.DataFrame(np.random.randn(4,5) + 4) print(A + B)
0 1 2 3 4 0 4.018707 5.687454 3.580436 6.782426 5.120555 1 5.253674 6.124992 1.169963 2.695430 5.307602 2 4.217585 5.948985 4.620141 4.040466 4.649786 3 3.890115 2.221573 1.245353 5.396740 5.127738
print(A * B)
0 1 2 3 4 0 0.919932 4.963098 2.499501 7.430605 3.691235 1 6.892923 5.313267 -6.571821 -2.097031 4.125348 2 1.573064 7.774351 -0.940208 1.292943 4.515915 3 -4.470558 -7.725281 -7.375697 -0.788945 4.405619
If you have a Series
with the same number of elements as the number of columns of a DataFrame
, you can do arithmetic operations, with each element of the Series
acting upon each column of the DataFrame
c = pd.Series([1,2,3,4,5]) print(A + c)
0 1 2 3 4 0 1.243690 3.076332 3.950349 5.373858 5.868005 1 3.712568 3.046159 0.955528 3.369492 5.945787 2 1.413522 3.938513 2.804749 4.350383 6.381913 3 0.072114 0.117606 0.836384 3.857570 6.091522
print(A * c)
0 1 2 3 4 0 0.243690 2.152664 2.851048 5.495433 4.340025 1 2.712568 2.092318 -6.133416 -2.522032 4.728935 2 0.413522 3.877025 -0.585752 1.401532 6.909563 3 -0.927886 -3.764787 -6.490847 -0.569721 5.457611
This idea can be used to standardize a dataset, i.e. make each column have mean 0 and standard deviation 1.
means = A.mean(axis=0) stds = A.std(axis = 0) (A - means)/stds
0 1 2 3 4 0 -0.240828 0.318354 1.202702 1.326047 -0.899601 1 1.380224 0.300287 -0.783339 -1.013572 -0.556263 2 -0.129317 0.834602 0.442988 0.131384 1.368838 3 -1.010079 -1.453244 -0.862351 -0.443858 0.087026
Concatenation of data sets
Let's create some example data sets
df1 = pd.DataFrame({'A': ['a'+str(i) for i in range(4)], 'B': ['b'+str(i) for i in range(4)], 'C': ['c'+str(i) for i in range(4)], 'D': ['d'+str(i) for i in range(4)]}) df2 = pd.DataFrame({'A': ['a'+str(i) for i in range(4,8)], 'B': ['b'+str(i) for i in range(4,8)], 'C': ['c'+str(i) for i in range(4,8)], 'D': ['d'+str(i) for i in range(4,8)]}) df3 = pd.DataFrame({'A': ['a'+str(i) for i in range(8,12)], 'B': ['b'+str(i) for i in range(8,12)], 'C': ['c'+str(i) for i in range(8,12)], 'D': ['d'+str(i) for i in range(8,12)]})
We can concatenate these DataFrame
objects by row
row_concatenate = pd.concat([df1, df2, df3]) print(row_concatenate)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 0 a8 b8 c8 d8 1 a9 b9 c9 d9 2 a10 b10 c10 d10 3 a11 b11 c11 d11
This stacks the dataframes together. They are literally stacked, as is evidenced by the index values being repeated.
This same exercise can be done by the append
function
df1.append(df2).append(df3)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 0 a8 b8 c8 d8 1 a9 b9 c9 d9 2 a10 b10 c10 d10 3 a11 b11 c11 d11
Suppose we want to append a new row to df1
. Let's create a new row.
new_row = pd.Series(['n1','n2','n3','n4']) pd.concat([df1, new_row])
A B C D 0 0 a0 b0 c0 d0 NaN 1 a1 b1 c1 d1 NaN 2 a2 b2 c2 d2 NaN 3 a3 b3 c3 d3 NaN 0 NaN NaN NaN NaN n1 1 NaN NaN NaN NaN n2 2 NaN NaN NaN NaN n3 3 NaN NaN NaN NaN n4
That's a lot of missing values. The issue is that we don't have column names in the new_row
, and the indices are the same, so pandas tries to append it by making a new column. The solution is to make it a DataFrame
.
new_row = pd.DataFrame([['n1','n2','n3','n4']], columns = ['A','B','C','D']) print(new_row)
A B C D 0 n1 n2 n3 n4
pd.concat([df1, new_row])
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 n1 n2 n3 n4
or
df1.append(new_row)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 n1 n2 n3 n4
Adding columns
pd.concat([df1,df2,df3], axis = 1)
A B C D A B C D A B C D 0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11
The option axis=1
ensures that concatenation happens by columns. The default value axis = 0
concatenates by rows.
Let's play a little game. Let's change the column names of df2
and df3
so they are not the same as df1
.
df2.columns = ['E','F','G','H'] df3.columns = ['A','D','F','H'] pd.concat([df1,df2,df3])
A B C D E F G H 0 a0 b0 c0 d0 NaN NaN NaN NaN 1 a1 b1 c1 d1 NaN NaN NaN NaN 2 a2 b2 c2 d2 NaN NaN NaN NaN 3 a3 b3 c3 d3 NaN NaN NaN NaN 0 NaN NaN NaN NaN a4 b4 c4 d4 1 NaN NaN NaN NaN a5 b5 c5 d5 2 NaN NaN NaN NaN a6 b6 c6 d6 3 NaN NaN NaN NaN a7 b7 c7 d7 0 a8 NaN NaN b8 NaN c8 NaN d8 1 a9 NaN NaN b9 NaN c9 NaN d9 2 a10 NaN NaN b10 NaN c10 NaN d10 3 a11 NaN NaN b11 NaN c11 NaN d11
Now pandas ensures that all column names are represented in the new
data frame, but with missing values where the row indices and column
indices are mismatched. Some of this can be avoided by only joining on
common columns. This is done using the join
option ir concat
. The default value is 'outer', which is what you see. above
pd.concat([df1, df3], join = 'inner')
A D 0 a0 d0 1 a1 d1 2 a2 d2 3 a3 d3 0 a8 b8 1 a9 b9 2 a10 b10 3 a11 b11
You can do the same thing when joining by rows, using axis = 0
and join="inner"
to only join on rows with matching indices. Reminder that the indices
are just labels and happen to be the row numbers by default.
Merging data sets
For this section we'll use a set of data from a survey, also used by Daniel Chen in "Pandas for Everyone"
person = pd.read_csv('data/survey_person.csv') site = pd.read_csv('data/survey_site.csv') survey = pd.read_csv('data/survey_survey.csv') visited = pd.read_csv('data/survey_visited.csv')
print(person)
ident personal family 0 dyer William Dyer 1 pb Frank Pabodie 2 lake Anderson Lake 3 roe Valentina Roerich 4 danforth Frank Danforth
print(site)
name lat long 0 DR-1 -49.85 -128.57 1 DR-3 -47.15 -126.72 2 MSK-4 -48.87 -123.40
print(survey)
taken person quant reading 0 619 dyer rad 9.82 1 619 dyer sal 0.13 2 622 dyer rad 7.80 3 622 dyer sal 0.09 4 734 pb rad 8.41 5 734 lake sal 0.05 6 734 pb temp -21.50 7 735 pb rad 7.22 8 735 NaN sal 0.06 9 735 NaN temp -26.00 10 751 pb rad 4.35 11 751 pb temp -18.50 12 751 lake sal 0.10 13 752 lake rad 2.19 14 752 lake sal 0.09 15 752 lake temp -16.00 16 752 roe sal 41.60 17 837 lake rad 1.46 18 837 lake sal 0.21 19 837 roe sal 22.50 20 844 roe rad 11.25
print(visited)
ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22
There are basically four kinds of joins:
pandas | R | SQL | Description |
---|---|---|---|
left | left_join | left outer | keep all rows on the left |
right | right_join | right outer | keep all rows on the right |
outer | outer_join | full outer | keep all rows from both |
inner | inner_join | inner | keep only rows with common keys |
The terms left
and right
refer to which data set you call first and second, respectively.
We start with a left join
s2v_merge = survey.merge(visited, left_on = 'taken',right_on = 'ident', how = 'left')
print(s2v_merge)
taken person quant reading ident site dated 0 619 dyer rad 9.82 619 DR-1 1927-02-08 1 619 dyer sal 0.13 619 DR-1 1927-02-08 2 622 dyer rad 7.80 622 DR-1 1927-02-10 3 622 dyer sal 0.09 622 DR-1 1927-02-10 4 734 pb rad 8.41 734 DR-3 1939-01-07 5 734 lake sal 0.05 734 DR-3 1939-01-07 6 734 pb temp -21.50 734 DR-3 1939-01-07 7 735 pb rad 7.22 735 DR-3 1930-01-12 8 735 NaN sal 0.06 735 DR-3 1930-01-12 9 735 NaN temp -26.00 735 DR-3 1930-01-12 10 751 pb rad 4.35 751 DR-3 1930-02-26 11 751 pb temp -18.50 751 DR-3 1930-02-26 12 751 lake sal 0.10 751 DR-3 1930-02-26 13 752 lake rad 2.19 752 DR-3 NaN 14 752 lake sal 0.09 752 DR-3 NaN 15 752 lake temp -16.00 752 DR-3 NaN 16 752 roe sal 41.60 752 DR-3 NaN 17 837 lake rad 1.46 837 MSK-4 1932-01-14 18 837 lake sal 0.21 837 MSK-4 1932-01-14 19 837 roe sal 22.50 837 MSK-4 1932-01-14 20 844 roe rad 11.25 844 DR-1 1932-03-22
Here, the left dataset is survey
, and the right one is visited
. Since we're doing a left join, we keed all the rows from survey
and add columns from visited
, matching on the common key, called "taken" in one dataset and "ident" in the other. Note that the rows of visited
are repeated as needed to line up with all the rows with common "taken" values.
We can now add location information, where the common key is the site code
s2v2loc_merge = s2v_merge.merge(site, how = 'left', left_on = 'site', right_on = 'name') print(s2v2loc_merge)
taken person quant reading ident site dated name lat long 0 619 dyer rad 9.82 619 DR-1 1927-02-08 DR-1 -49.85 -128.57 1 619 dyer sal 0.13 619 DR-1 1927-02-08 DR-1 -49.85 -128.57 2 622 dyer rad 7.80 622 DR-1 1927-02-10 DR-1 -49.85 -128.57 3 622 dyer sal 0.09 622 DR-1 1927-02-10 DR-1 -49.85 -128.57 4 734 pb rad 8.41 734 DR-3 1939-01-07 DR-3 -47.15 -126.72 5 734 lake sal 0.05 734 DR-3 1939-01-07 DR-3 -47.15 -126.72 6 734 pb temp -21.50 734 DR-3 1939-01-07 DR-3 -47.15 -126.72 7 735 pb rad 7.22 735 DR-3 1930-01-12 DR-3 -47.15 -126.72 8 735 NaN sal 0.06 735 DR-3 1930-01-12 DR-3 -47.15 -126.72 9 735 NaN temp -26.00 735 DR-3 1930-01-12 DR-3 -47.15 -126.72 10 751 pb rad 4.35 751 DR-3 1930-02-26 DR-3 -47.15 -126.72 11 751 pb temp -18.50 751 DR-3 1930-02-26 DR-3 -47.15 -126.72 12 751 lake sal 0.10 751 DR-3 1930-02-26 DR-3 -47.15 -126.72 13 752 lake rad 2.19 752 DR-3 NaN DR-3 -47.15 -126.72 14 752 lake sal 0.09 752 DR-3 NaN DR-3 -47.15 -126.72 15 752 lake temp -16.00 752 DR-3 NaN DR-3 -47.15 -126.72 16 752 roe sal 41.60 752 DR-3 NaN DR-3 -47.15 -126.72 17 837 lake rad 1.46 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40 18 837 lake sal 0.21 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40 19 837 roe sal 22.50 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40 20 844 roe rad 11.25 844 DR-1 1932-03-22 DR-1 -49.85 -128.57
Lastly, we add the person information to this dataset.
merged = s2v2loc_merge.merge(person, how = 'left', left_on = 'person', right_on = 'ident') print(merged.head())
taken person quant reading ... long ident_y personal family 0 619 dyer rad 9.82 ... -128.57 dyer William Dyer 1 619 dyer sal 0.13 ... -128.57 dyer William Dyer 2 622 dyer rad 7.80 ... -128.57 dyer William Dyer 3 622 dyer sal 0.09 ... -128.57 dyer William Dyer 4 734 pb rad 8.41 ... -126.72 pb Frank Pabodie [5 rows x 13 columns]
You can merge based on multiple columns as long as they match up.
ps = person.merge(survey, left_on = 'ident', right_on = 'person') vs = visited.merge(survey, left_on = 'ident', right_on = 'taken') print(ps)
ident personal family taken person quant reading 0 dyer William Dyer 619 dyer rad 9.82 1 dyer William Dyer 619 dyer sal 0.13 2 dyer William Dyer 622 dyer rad 7.80 3 dyer William Dyer 622 dyer sal 0.09 4 pb Frank Pabodie 734 pb rad 8.41 5 pb Frank Pabodie 734 pb temp -21.50 6 pb Frank Pabodie 735 pb rad 7.22 7 pb Frank Pabodie 751 pb rad 4.35 8 pb Frank Pabodie 751 pb temp -18.50 9 lake Anderson Lake 734 lake sal 0.05 10 lake Anderson Lake 751 lake sal 0.10 11 lake Anderson Lake 752 lake rad 2.19 12 lake Anderson Lake 752 lake sal 0.09 13 lake Anderson Lake 752 lake temp -16.00 14 lake Anderson Lake 837 lake rad 1.46 15 lake Anderson Lake 837 lake sal 0.21 16 roe Valentina Roerich 752 roe sal 41.60 17 roe Valentina Roerich 837 roe sal 22.50 18 roe Valentina Roerich 844 roe rad 11.25
print(vs)
ident site dated taken person quant reading 0 619 DR-1 1927-02-08 619 dyer rad 9.82 1 619 DR-1 1927-02-08 619 dyer sal 0.13 2 622 DR-1 1927-02-10 622 dyer rad 7.80 3 622 DR-1 1927-02-10 622 dyer sal 0.09 4 734 DR-3 1939-01-07 734 pb rad 8.41 5 734 DR-3 1939-01-07 734 lake sal 0.05 6 734 DR-3 1939-01-07 734 pb temp -21.50 7 735 DR-3 1930-01-12 735 pb rad 7.22 8 735 DR-3 1930-01-12 735 NaN sal 0.06 9 735 DR-3 1930-01-12 735 NaN temp -26.00 10 751 DR-3 1930-02-26 751 pb rad 4.35 11 751 DR-3 1930-02-26 751 pb temp -18.50 12 751 DR-3 1930-02-26 751 lake sal 0.10 13 752 DR-3 NaN 752 lake rad 2.19 14 752 DR-3 NaN 752 lake sal 0.09 15 752 DR-3 NaN 752 lake temp -16.00 16 752 DR-3 NaN 752 roe sal 41.60 17 837 MSK-4 1932-01-14 837 lake rad 1.46 18 837 MSK-4 1932-01-14 837 lake sal 0.21 19 837 MSK-4 1932-01-14 837 roe sal 22.50 20 844 DR-1 1932-03-22 844 roe rad 11.25
ps_vs = ps.merge(vs, left_on = ['ident','taken', 'quant','reading'], right_on = ['person','ident','quant','reading']) # The keys need to correspond ps_vs.head()
ident_x personal family taken_x ... site dated taken_y person_y 0 dyer William Dyer 619 ... DR-1 1927-02-08 619 dyer 1 dyer William Dyer 619 ... DR-1 1927-02-08 619 dyer 2 dyer William Dyer 622 ... DR-1 1927-02-10 622 dyer 3 dyer William Dyer 622 ... DR-1 1927-02-10 622 dyer 4 pb Frank Pabodie 734 ... DR-3 1939-01-07 734 pb [5 rows x 12 columns]
Note that since there are common column names, the merge appends _x
and _y
to denote which column came from the left and right, respectively.
Tidy data principles and reshaping datasets
The tidy data principle is a principle espoused by Dr. Hadley Wickham, one of the foremost R developers. Tidy data is a structure for datasets to make them more easily analyzed on computers. The basic principles are
- Each row is an observation
- Each column is a variable
- Each type of observational unit forms a table
Tidy data is tidy in one way. Untidy data can be untidy in many ways
Let's look at some examples.
from glob import glob filenames = sorted(glob('data/table*.csv')) # find files matching pattern. I know there are 6 of them table1, table2, table3, table4a, table4b, table5 = [pd.read_csv(f) for f in filenames] # Use a list comprehension
This code imports data from 6 files matching a pattern. Python allows multiple assignments on the left of the =
,
and as each dataset is imported, it gets assigned in order to the
variables on the left. In the second line, I sort the file names so that
they match the order in which I'm storing them in the 3rd line. The
function glob
does pattern-matching of file names.
The following tables refer to the number of TB cases and population in Afghanistan, Brazil, and China in 1999 and 2000
print(table1)
country year cases population 0 Afghanistan 1999 745 19987071 1 Afghanistan 2000 2666 20595360 2 Brazil 1999 37737 172006362 3 Brazil 2000 80488 174504898 4 China 1999 212258 1272915272 5 China 2000 213766 1280428583
print(table2)
country year type count 0 Afghanistan 1999 cases 745 1 Afghanistan 1999 population 19987071 2 Afghanistan 2000 cases 2666 3 Afghanistan 2000 population 20595360 4 Brazil 1999 cases 37737 5 Brazil 1999 population 172006362 6 Brazil 2000 cases 80488 7 Brazil 2000 population 174504898 8 China 1999 cases 212258 9 China 1999 population 1272915272 10 China 2000 cases 213766 11 China 2000 population 1280428583
print(table3)
country year rate 0 Afghanistan 1999 745/19987071 1 Afghanistan 2000 2666/20595360 2 Brazil 1999 37737/172006362 3 Brazil 2000 80488/174504898 4 China 1999 212258/1272915272 5 China 2000 213766/1280428583
print(table4a) # cases
country 1999 2000 0 Afghanistan 745 2666 1 Brazil 37737 80488 2 China 212258 213766
print(table4b) # population
country 1999 2000 0 Afghanistan 19987071 20595360 1 Brazil 172006362 174504898 2 China 1272915272 1280428583
print(table5)
country century year rate 0 Afghanistan 19 99 745/19987071 1 Afghanistan 20 0 2666/20595360 2 Brazil 19 99 37737/172006362 3 Brazil 20 0 80488/174504898 4 China 19 99 212258/1272915272 5 China 20 0 213766/1280428583
Exercise: Describe why and why not each of these datasets are tidy.
Melting (unpivoting) data
Melting is the operation of collapsing multiple columns into 2 columns, where one column is formed by the old column names and the other by the corresponding values. Some columns may be kept fixed, and their data are repeated to maintain the interrelationships between the variables.
We'll start with loading some data on income and religion in the US from the Pew Research Center.
pew = pd.read_csv('data/pew.csv') print(pew.head())
religion <$10k $10-20k ... $100-150k >150k Don't know/refused 0 Agnostic 27 34 ... 109 84 96 1 Atheist 12 27 ... 59 74 76 2 Buddhist 27 21 ... 39 53 54 3 Catholic 418 617 ... 792 633 1489 4 Don't know/refused 15 14 ... 17 18 116 [5 rows x 11 columns]
This dataset is considered to be in "wide" format. There are several issues with it, including the fact that column headers have data. Those column headers are income groups that should be a column by tidy principles. Our job is to turn this dataset into "long" format with a column for income group.
We will use the function melt
to achieve this. This takes a few parameters:
- id_vars is a list of variables that will remain as is
- value_vars is a list of column names that we will melt (or unpivot). By default, it will melt all columns not mentioned in id_vars
- var_name is a string giving the name of the new column created by the headers (default:
variable
) - value_name is a string giving the name of the new column created by the values (default:
value
)
pew_long = pew.melt(id_vars = ['religion'], var_name = 'income_group', value_name = 'count') print(pew_long.head())
religion income_group count 0 Agnostic <$10k 27 1 Atheist <$10k 12 2 Buddhist <$10k 27 3 Catholic <$10k 418 4 Don't know/refused <$10k 15
Separating columns containing multiple variables
We will use an Ebola dataset to illustrate this principle
ebola = pd.read_csv('data/country_timeseries.csv') print(ebola.head())
Date Day ... Deaths_Spain Deaths_Mali 0 1/5/2015 289 ... NaN NaN 1 1/4/2015 288 ... NaN NaN 2 1/3/2015 287 ... NaN NaN 3 1/2/2015 286 ... NaN NaN 4 12/31/2014 284 ... NaN NaN [5 rows x 18 columns]
Note that for each country, we have two columns – one for cases (number infected) and one for deaths. Ideally, we want one column for country, one for cases, and one for deaths.
The first step will be to melt this data set so that the column headers in question form a column and the corresponding data forms a second column.
ebola_long = ebola.melt(id_vars = ['Date','Day']) print(ebola_long.head())
Date Day variable value 0 1/5/2015 289 Cases_Guinea 2776.0 1 1/4/2015 288 Cases_Guinea 2775.0 2 1/3/2015 287 Cases_Guinea 2769.0 3 1/2/2015 286 Cases_Guinea NaN 4 12/31/2014 284 Cases_Guinea 2730.0
We now need to split the data in the variable
column to
make two columns. One will contain the country name and the other either
Cases or Deaths. We will use some string manipulation functions that we
will see later to achieve this.
variable_split = ebola_long['variable'].str.split('_', expand=True) # split on the '_' character print(variable_split[:5])
0 1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea
The expand=True
option forces the creation of an DataFrame
rather than a list
type(variable_split)
<class 'pandas.core.frame.DataFrame'>
We can now concatenate this to the original data
variable_split.columns = ['status','country'] ebola_parsed = pd.concat([ebola_long, variable_split], axis = 1) ebola_parsed.drop('variable', axis = 1, inplace=True) # Remove the column named "variable" and replace the old data with the new one in the same location print(ebola_parsed.head())
Date Day value status country 0 1/5/2015 289 2776.0 Cases Guinea 1 1/4/2015 288 2775.0 Cases Guinea 2 1/3/2015 287 2769.0 Cases Guinea 3 1/2/2015 286 NaN Cases Guinea 4 12/31/2014 284 2730.0 Cases Guinea
Pivot/spread datasets
If we wanted to, we could also make two columns based on cases and
deaths, so for each country and date, you could easily read off the cases
and deaths. This is achieved using the pivot_table
function.
In the pivot_table
syntax, index
refers to the columns we don't want to change, columns
refers to the column whose values will form the column names of the new columns, and values
is the name of the column that will form the values in the pivoted dataset.
ebola_parsed.pivot_table(index = ['Date','Day', 'country'], columns = 'status', values = 'value')
status Cases Deaths Date Day country 1/2/2015 286 Liberia 8157.0 3496.0 1/3/2015 287 Guinea 2769.0 1767.0 Liberia 8166.0 3496.0 SierraLeone 9722.0 2915.0 1/4/2015 288 Guinea 2775.0 1781.0 ... ... ... 9/7/2014 169 Liberia 2081.0 1137.0 Nigeria 21.0 8.0 Senegal 3.0 0.0 SierraLeone 1424.0 524.0 9/9/2014 171 Liberia 2407.0 NaN [375 rows x 2 columns]
This creates something called MultiIndex
in the pandas
DataFrame
. This is useful in some advanced cases, but here, we just want a normal DataFrame
back. We can achieve that by using the reset_index
function.
ebola_parsed.pivot_table(index = ['Date','Day','country'], columns = 'status', values = 'value').reset_index()
status Date Day country Cases Deaths 0 1/2/2015 286 Liberia 8157.0 3496.0 1 1/3/2015 287 Guinea 2769.0 1767.0 2 1/3/2015 287 Liberia 8166.0 3496.0 3 1/3/2015 287 SierraLeone 9722.0 2915.0 4 1/4/2015 288 Guinea 2775.0 1781.0 .. ... ... ... ... ... 370 9/7/2014 169 Liberia 2081.0 1137.0 371 9/7/2014 169 Nigeria 21.0 8.0 372 9/7/2014 169 Senegal 3.0 0.0 373 9/7/2014 169 SierraLeone 1424.0 524.0 374 9/9/2014 171 Liberia 2407.0 NaN [375 rows x 5 columns]
Pivoting is a 2-column to many-column operation, with the number of
columns formed depending on the number of unique values present in the
column of the original data that is entered into the columns
argument of pivot_table
Exercise: Load the file weather.csv
into Python and work on making it a tidy dataset. It requires melting
and pivoting. The dataset comprises the maximum and minimum
temperatures recorded each day in 2010. There are lots of missing values.
Ultimately we want columns for days of the month, maximum temperature, and minimum temperature, along with the location ID, the year, and the
month.
Source: Abhijit Dasgupta, https://www.araastat.com/BIOF085/pandas.html#data-transformation
This work is licensed under a Creative Commons Attribution 4.0 License.
Data Aggregation and Split-Apply-Combine
We'll use the Gapminder dataset for this section
df = pd.read_csv('data/gapminder.tsv', sep = '\t') # data is tab-separated, so we use '\t' to specify that
The paradigm we will be exploring is often called split-apply-combine or MapReduce or grouped aggregation. The basic idea is that you split a data set up by some feature, apply a recipe to each piece, compute the result, and then put the results back together into a dataset. This can be described in teh following schematic.
pandas
is set up for this. It features the groupby
function that allows the "split" part of the operation. We can then
apply a function to each part and put it back together. Let's see how.
df.head()
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106
f"This dataset has {len(df['country'].unique())} countries in it"
'This dataset has 142 countries in it'
One of the variables in this dataset is life expectancy at birth, lifeExp
. Suppose we want to find the average life expectancy of each country over the period of study.
df.groupby('country')['lifeExp'].mean()
country Afghanistan 37.478833 Albania 68.432917 Algeria 59.030167 Angola 37.883500 Argentina 69.060417 ... Vietnam 57.479500 West Bank and Gaza 60.328667 Yemen, Rep. 46.780417 Zambia 45.996333 Zimbabwe 52.663167 Name: lifeExp, Length: 142, dtype: float64
So what's going on here? First, we use the groupby
function, telling pandas
to split the dataset up by values of the column country
.
df.groupby('country')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11cfb1b50>
pandas
won't show you the actual data, but will tell you
that it is a grouped dataframe object. This means that each element of
this object is a DataFrame
with data from one country.
df.groupby('country').ngroups
142
df.groupby('country').get_group('United Kingdom')
country continent year lifeExp pop gdpPercap 1596 United Kingdom Europe 1952 69.180 50430000 9979.508487 1597 United Kingdom Europe 1957 70.420 51430000 11283.177950 1598 United Kingdom Europe 1962 70.760 53292000 12477.177070 1599 United Kingdom Europe 1967 71.360 54959000 14142.850890 1600 United Kingdom Europe 1972 72.010 56079000 15895.116410 1601 United Kingdom Europe 1977 72.760 56179000 17428.748460 1602 United Kingdom Europe 1982 74.040 56339704 18232.424520 1603 United Kingdom Europe 1987 75.007 56981620 21664.787670 1604 United Kingdom Europe 1992 76.420 57866349 22705.092540 1605 United Kingdom Europe 1997 77.218 58808266 26074.531360 1606 United Kingdom Europe 2002 78.471 59912431 29478.999190 1607 United Kingdom Europe 2007 79.425 60776238 33203.261280
type(df.groupby('country').get_group('United Kingdom'))
<class 'pandas.core.frame.DataFrame'>
avg_lifeexp_country = df.groupby('country').lifeExp.mean() avg_lifeexp_country['United Kingdom']
73.92258333333332
df.groupby('country').get_group('United Kingdom').lifeExp.mean()
73.92258333333332
Let's look at if life expectancy has gone up over time, by continent
df.groupby(['continent','year']).lifeExp.mean()
continent year Africa 1952 39.135500 1957 41.266346 1962 43.319442 1967 45.334538 1972 47.450942 1977 49.580423 1982 51.592865 1987 53.344788 1992 53.629577 1997 53.598269 2002 53.325231 2007 54.806038 Americas 1952 53.279840 1957 55.960280 1962 58.398760 1967 60.410920 1972 62.394920 1977 64.391560 1982 66.228840 1987 68.090720 1992 69.568360 1997 71.150480 2002 72.422040 2007 73.608120 Asia 1952 46.314394 1957 49.318544 1962 51.563223 1967 54.663640 1972 57.319269 1977 59.610556 1982 62.617939 1987 64.851182 1992 66.537212 1997 68.020515 2002 69.233879 2007 70.728485 Europe 1952 64.408500 1957 66.703067 1962 68.539233 1967 69.737600 1972 70.775033 1977 71.937767 1982 72.806400 1987 73.642167 1992 74.440100 1997 75.505167 2002 76.700600 2007 77.648600 Oceania 1952 69.255000 1957 70.295000 1962 71.085000 1967 71.310000 1972 71.910000 1977 72.855000 1982 74.290000 1987 75.320000 1992 76.945000 1997 78.190000 2002 79.740000 2007 80.719500 Name: lifeExp, dtype: float64
avg_lifeexp_continent_yr = df.groupby(['continent','year']).lifeExp.mean().reset_index() avg_lifeexp_continent_yr
continent year lifeExp 0 Africa 1952 39.135500 1 Africa 1957 41.266346 2 Africa 1962 43.319442 3 Africa 1967 45.334538 4 Africa 1972 47.450942 5 Africa 1977 49.580423 6 Africa 1982 51.592865 7 Africa 1987 53.344788 8 Africa 1992 53.629577 9 Africa 1997 53.598269 10 Africa 2002 53.325231 11 Africa 2007 54.806038 12 Americas 1952 53.279840 13 Americas 1957 55.960280 14 Americas 1962 58.398760 15 Americas 1967 60.410920 16 Americas 1972 62.394920 17 Americas 1977 64.391560 18 Americas 1982 66.228840 19 Americas 1987 68.090720 20 Americas 1992 69.568360 21 Americas 1997 71.150480 22 Americas 2002 72.422040 23 Americas 2007 73.608120 24 Asia 1952 46.314394 25 Asia 1957 49.318544 26 Asia 1962 51.563223 27 Asia 1967 54.663640 28 Asia 1972 57.319269 29 Asia 1977 59.610556 30 Asia 1982 62.617939 31 Asia 1987 64.851182 32 Asia 1992 66.537212 33 Asia 1997 68.020515 34 Asia 2002 69.233879 35 Asia 2007 70.728485 36 Europe 1952 64.408500 37 Europe 1957 66.703067 38 Europe 1962 68.539233 39 Europe 1967 69.737600 40 Europe 1972 70.775033 41 Europe 1977 71.937767 42 Europe 1982 72.806400 43 Europe 1987 73.642167 44 Europe 1992 74.440100 45 Europe 1997 75.505167 46 Europe 2002 76.700600 47 Europe 2007 77.648600 48 Oceania 1952 69.255000 49 Oceania 1957 70.295000 50 Oceania 1962 71.085000 51 Oceania 1967 71.310000 52 Oceania 1972 71.910000 53 Oceania 1977 72.855000 54 Oceania 1982 74.290000 55 Oceania 1987 75.320000 56 Oceania 1992 76.945000 57 Oceania 1997 78.190000 58 Oceania 2002 79.740000 59 Oceania 2007 80.719500
type(avg_lifeexp_continent_yr)
<class 'pandas.core.frame.DataFrame'>
The aggregation function, in this case mean
, does both the "apply" and "combine" parts of the process.
We can do quick aggregations with pandas
df.groupby('continent').lifeExp.describe()
count mean std ... 50% 75% max continent ... Africa 624.0 48.865330 9.150210 ... 47.7920 54.41150 76.442 Americas 300.0 64.658737 9.345088 ... 67.0480 71.69950 80.653 Asia 396.0 60.064903 11.864532 ... 61.7915 69.50525 82.603 Europe 360.0 71.903686 5.433178 ... 72.2410 75.45050 81.757 Oceania 24.0 74.326208 3.795611 ... 73.6650 77.55250 81.235 [5 rows x 8 columns]
df.groupby('continent').nth(10) # Tenth observation in each group
country year lifeExp pop gdpPercap continent Africa Algeria 2002 70.994 31287142 5288.040382 Americas Argentina 2002 74.340 38331121 8797.640716 Asia Afghanistan 2002 42.129 25268405 726.734055 Europe Albania 2002 75.651 3508512 4604.211737 Oceania Australia 2002 80.370 19546792 30687.754730
You can also use functions from other modules, or your own functions in this aggregation work.
df.groupby('continent').lifeExp.agg(np.mean)
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
def my_mean(values): n = len(values) sum = 0 for value in values: sum += value return(sum/n) df.groupby('continent').lifeExp.agg(my_mean)
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
You can do many functions at once
df.groupby('year').lifeExp.agg([np.count_nonzero, np.mean, np.std])
count_nonzero mean std year 1952 142.0 49.057620 12.225956 1957 142.0 51.507401 12.231286 1962 142.0 53.609249 12.097245 1967 142.0 55.678290 11.718858 1972 142.0 57.647386 11.381953 1977 142.0 59.570157 11.227229 1982 142.0 61.533197 10.770618 1987 142.0 63.212613 10.556285 1992 142.0 64.160338 11.227380 1997 142.0 65.014676 11.559439 2002 142.0 65.694923 12.279823 2007 142.0 67.007423 12.073021
You can also aggregate on different columns at the same time by passing a dict
to the agg
function
df.groupby('year').agg({'lifeExp': np.mean,'pop': np.median,'gdpPercap': np.median}).reset_index()
year lifeExp pop gdpPercap 0 1952 49.057620 3943953.0 1968.528344 1 1957 51.507401 4282942.0 2173.220291 2 1962 53.609249 4686039.5 2335.439533 3 1967 55.678290 5170175.5 2678.334741 4 1972 57.647386 5877996.5 3339.129407 5 1977 59.570157 6404036.5 3798.609244 6 1982 61.533197 7007320.0 4216.228428 7 1987 63.212613 7774861.5 4280.300366 8 1992 64.160338 8688686.5 4386.085502 9 1997 65.014676 9735063.5 4781.825478 10 2002 65.694923 10372918.5 5319.804524 11 2007 67.007423 10517531.0 6124.371109
Transformation
You can do grouped transformations using this same method. We will compute the z-score for each year, i.e. we will substract the average life expectancy and divide by the standard deviation
def my_zscore(values): m = np.mean(values) s = np.std(values) return((values - m)/s)
df.groupby('year').lifeExp.transform(my_zscore)
0 -1.662719 1 -1.737377 2 -1.792867 3 -1.854699 4 -1.900878 ... 1699 -0.081910 1700 -0.338167 1701 -1.580537 1702 -2.100756 1703 -1.955077 Name: lifeExp, Length: 1704, dtype: float64
df['lifeExp_z'] = df.groupby('year').lifeExp.transform(my_zscore)
df.groupby('year').lifeExp_z.mean()
year 1952 -5.165078e-16 1957 2.902608e-17 1962 2.404180e-16 1967 -6.108181e-16 1972 1.784566e-16 1977 -9.456442e-16 1982 -1.623310e-16 1987 6.687725e-16 1992 5.457293e-16 1997 8.787963e-16 2002 5.254013e-16 2007 4.925637e-16 Name: lifeExp_z, dtype: float64
Filter
We can split the dataset by values of one variable, and filter out those splits that fail some criterion. The following code only keeps countries with a population of at least 10 million at some point during the study period
df.groupby('country').filter(lambda d: d['pop'].max() > 10000000)
country continent year lifeExp pop gdpPercap lifeExp_z 0 Afghanistan Asia 1952 28.801 8425333 779.445314 -1.662719 1 Afghanistan Asia 1957 30.332 9240934 820.853030 -1.737377 2 Afghanistan Asia 1962 31.997 10267083 853.100710 -1.792867 3 Afghanistan Asia 1967 34.020 11537966 836.197138 -1.854699 4 Afghanistan Asia 1972 36.088 13079460 739.981106 -1.900878 ... ... ... ... ... ... ... ... 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306 -0.081910 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786 -0.338167 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960 -1.580537 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623 -2.100756 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298 -1.955077 [924 rows x 7 columns]