With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations.
Data Transformation
Arithmetic operations
If you have a Series
or DataFrame
that is all numeric, you can add or multiply single numbers to all the elements together.
A = pd.DataFrame(np.random.randn(4,5)) print(A)
0 1 2 3 4 0 0.243690 1.076332 0.950349 1.373858 0.868005 1 2.712568 1.046159 -2.044472 -0.630508 0.945787 2 0.413522 1.938513 -0.195251 0.350383 1.381913 3 -0.927886 -1.882394 -2.163616 -0.142430 1.091522
print(A + 6)
0 1 2 3 4 0 6.243690 7.076332 6.950349 7.373858 6.868005 1 8.712568 7.046159 3.955528 5.369492 6.945787 2 6.413522 7.938513 5.804749 6.350383 7.381913 3 5.072114 4.117606 3.836384 5.857570 7.091522
print(A * -10)
0 1 2 3 4 0 -2.436896 -10.763319 -9.503492 -13.738582 -8.680051 1 -27.125677 -10.461590 20.444719 6.305081 -9.457871 2 -4.135220 -19.385126 1.952507 -3.503830 -13.819127 3 9.278864 18.823936 21.636155 1.424302 -10.915223
If you have two compatible (same dimension) numeric DataFrame
s, you can add, subtract, multiply and divide elementwise
B = pd.DataFrame(np.random.randn(4,5) + 4) print(A + B)
0 1 2 3 4 0 4.018707 5.687454 3.580436 6.782426 5.120555 1 5.253674 6.124992 1.169963 2.695430 5.307602 2 4.217585 5.948985 4.620141 4.040466 4.649786 3 3.890115 2.221573 1.245353 5.396740 5.127738
print(A * B)
0 1 2 3 4 0 0.919932 4.963098 2.499501 7.430605 3.691235 1 6.892923 5.313267 -6.571821 -2.097031 4.125348 2 1.573064 7.774351 -0.940208 1.292943 4.515915 3 -4.470558 -7.725281 -7.375697 -0.788945 4.405619
If you have a Series
with the same number of elements as the number of columns of a DataFrame
, you can do arithmetic operations, with each element of the Series
acting upon each column of the DataFrame
c = pd.Series([1,2,3,4,5]) print(A + c)
0 1 2 3 4 0 1.243690 3.076332 3.950349 5.373858 5.868005 1 3.712568 3.046159 0.955528 3.369492 5.945787 2 1.413522 3.938513 2.804749 4.350383 6.381913 3 0.072114 0.117606 0.836384 3.857570 6.091522
print(A * c)
0 1 2 3 4 0 0.243690 2.152664 2.851048 5.495433 4.340025 1 2.712568 2.092318 -6.133416 -2.522032 4.728935 2 0.413522 3.877025 -0.585752 1.401532 6.909563 3 -0.927886 -3.764787 -6.490847 -0.569721 5.457611
This idea can be used to standardize a dataset, i.e. make each column have mean 0 and standard deviation 1.
means = A.mean(axis=0) stds = A.std(axis = 0) (A - means)/stds
0 1 2 3 4 0 -0.240828 0.318354 1.202702 1.326047 -0.899601 1 1.380224 0.300287 -0.783339 -1.013572 -0.556263 2 -0.129317 0.834602 0.442988 0.131384 1.368838 3 -1.010079 -1.453244 -0.862351 -0.443858 0.087026
Concatenation of data sets
Let's create some example data sets
df1 = pd.DataFrame({'A': ['a'+str(i) for i in range(4)], 'B': ['b'+str(i) for i in range(4)], 'C': ['c'+str(i) for i in range(4)], 'D': ['d'+str(i) for i in range(4)]}) df2 = pd.DataFrame({'A': ['a'+str(i) for i in range(4,8)], 'B': ['b'+str(i) for i in range(4,8)], 'C': ['c'+str(i) for i in range(4,8)], 'D': ['d'+str(i) for i in range(4,8)]}) df3 = pd.DataFrame({'A': ['a'+str(i) for i in range(8,12)], 'B': ['b'+str(i) for i in range(8,12)], 'C': ['c'+str(i) for i in range(8,12)], 'D': ['d'+str(i) for i in range(8,12)]})
We can concatenate these DataFrame
objects by row
row_concatenate = pd.concat([df1, df2, df3]) print(row_concatenate)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 0 a8 b8 c8 d8 1 a9 b9 c9 d9 2 a10 b10 c10 d10 3 a11 b11 c11 d11
This stacks the dataframes together. They are literally stacked, as is evidenced by the index values being repeated.
This same exercise can be done by the append
function
df1.append(df2).append(df3)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 0 a8 b8 c8 d8 1 a9 b9 c9 d9 2 a10 b10 c10 d10 3 a11 b11 c11 d11
Suppose we want to append a new row to df1
. Let's create a new row.
new_row = pd.Series(['n1','n2','n3','n4']) pd.concat([df1, new_row])
A B C D 0 0 a0 b0 c0 d0 NaN 1 a1 b1 c1 d1 NaN 2 a2 b2 c2 d2 NaN 3 a3 b3 c3 d3 NaN 0 NaN NaN NaN NaN n1 1 NaN NaN NaN NaN n2 2 NaN NaN NaN NaN n3 3 NaN NaN NaN NaN n4
That's a lot of missing values. The issue is that we don't have column names in the new_row
, and the indices are the same, so pandas tries to append it by making a new column. The solution is to make it a DataFrame
.
new_row = pd.DataFrame([['n1','n2','n3','n4']], columns = ['A','B','C','D']) print(new_row)
A B C D 0 n1 n2 n3 n4
pd.concat([df1, new_row])
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 n1 n2 n3 n4
or
df1.append(new_row)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 n1 n2 n3 n4
Adding columns
pd.concat([df1,df2,df3], axis = 1)
A B C D A B C D A B C D 0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11
The option axis=1
ensures that concatenation happens by columns. The default value axis = 0
concatenates by rows.
Let's play a little game. Let's change the column names of df2
and df3
so they are not the same as df1
.
df2.columns = ['E','F','G','H'] df3.columns = ['A','D','F','H'] pd.concat([df1,df2,df3])
A B C D E F G H 0 a0 b0 c0 d0 NaN NaN NaN NaN 1 a1 b1 c1 d1 NaN NaN NaN NaN 2 a2 b2 c2 d2 NaN NaN NaN NaN 3 a3 b3 c3 d3 NaN NaN NaN NaN 0 NaN NaN NaN NaN a4 b4 c4 d4 1 NaN NaN NaN NaN a5 b5 c5 d5 2 NaN NaN NaN NaN a6 b6 c6 d6 3 NaN NaN NaN NaN a7 b7 c7 d7 0 a8 NaN NaN b8 NaN c8 NaN d8 1 a9 NaN NaN b9 NaN c9 NaN d9 2 a10 NaN NaN b10 NaN c10 NaN d10 3 a11 NaN NaN b11 NaN c11 NaN d11
Now pandas ensures that all column names are represented in the new
data frame, but with missing values where the row indices and column
indices are mismatched. Some of this can be avoided by only joining on
common columns. This is done using the join
option ir concat
. The default value is 'outer', which is what you see. above
pd.concat([df1, df3], join = 'inner')
A D 0 a0 d0 1 a1 d1 2 a2 d2 3 a3 d3 0 a8 b8 1 a9 b9 2 a10 b10 3 a11 b11
You can do the same thing when joining by rows, using axis = 0
and join="inner"
to only join on rows with matching indices. Reminder that the indices
are just labels and happen to be the row numbers by default.
Merging data sets
For this section we'll use a set of data from a survey, also used by Daniel Chen in "Pandas for Everyone"
person = pd.read_csv('data/survey_person.csv') site = pd.read_csv('data/survey_site.csv') survey = pd.read_csv('data/survey_survey.csv') visited = pd.read_csv('data/survey_visited.csv')
print(person)
ident personal family 0 dyer William Dyer 1 pb Frank Pabodie 2 lake Anderson Lake 3 roe Valentina Roerich 4 danforth Frank Danforth
print(site)
name lat long 0 DR-1 -49.85 -128.57 1 DR-3 -47.15 -126.72 2 MSK-4 -48.87 -123.40
print(survey)
taken person quant reading 0 619 dyer rad 9.82 1 619 dyer sal 0.13 2 622 dyer rad 7.80 3 622 dyer sal 0.09 4 734 pb rad 8.41 5 734 lake sal 0.05 6 734 pb temp -21.50 7 735 pb rad 7.22 8 735 NaN sal 0.06 9 735 NaN temp -26.00 10 751 pb rad 4.35 11 751 pb temp -18.50 12 751 lake sal 0.10 13 752 lake rad 2.19 14 752 lake sal 0.09 15 752 lake temp -16.00 16 752 roe sal 41.60 17 837 lake rad 1.46 18 837 lake sal 0.21 19 837 roe sal 22.50 20 844 roe rad 11.25
print(visited)
ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22
There are basically four kinds of joins:
pandas | R | SQL | Description |
---|---|---|---|
left | left_join | left outer | keep all rows on the left |
right | right_join | right outer | keep all rows on the right |
outer | outer_join | full outer | keep all rows from both |
inner | inner_join | inner | keep only rows with common keys |
The terms left
and right
refer to which data set you call first and second, respectively.
We start with a left join
s2v_merge = survey.merge(visited, left_on = 'taken',right_on = 'ident', how = 'left')
print(s2v_merge)
taken person quant reading ident site dated 0 619 dyer rad 9.82 619 DR-1 1927-02-08 1 619 dyer sal 0.13 619 DR-1 1927-02-08 2 622 dyer rad 7.80 622 DR-1 1927-02-10 3 622 dyer sal 0.09 622 DR-1 1927-02-10 4 734 pb rad 8.41 734 DR-3 1939-01-07 5 734 lake sal 0.05 734 DR-3 1939-01-07 6 734 pb temp -21.50 734 DR-3 1939-01-07 7 735 pb rad 7.22 735 DR-3 1930-01-12 8 735 NaN sal 0.06 735 DR-3 1930-01-12 9 735 NaN temp -26.00 735 DR-3 1930-01-12 10 751 pb rad 4.35 751 DR-3 1930-02-26 11 751 pb temp -18.50 751 DR-3 1930-02-26 12 751 lake sal 0.10 751 DR-3 1930-02-26 13 752 lake rad 2.19 752 DR-3 NaN 14 752 lake sal 0.09 752 DR-3 NaN 15 752 lake temp -16.00 752 DR-3 NaN 16 752 roe sal 41.60 752 DR-3 NaN 17 837 lake rad 1.46 837 MSK-4 1932-01-14 18 837 lake sal 0.21 837 MSK-4 1932-01-14 19 837 roe sal 22.50 837 MSK-4 1932-01-14 20 844 roe rad 11.25 844 DR-1 1932-03-22
Here, the left dataset is survey
, and the right one is visited
. Since we're doing a left join, we keed all the rows from survey
and add columns from visited
, matching on the common key, called "taken" in one dataset and "ident" in the other. Note that the rows of visited
are repeated as needed to line up with all the rows with common "taken" values.
We can now add location information, where the common key is the site code
s2v2loc_merge = s2v_merge.merge(site, how = 'left', left_on = 'site', right_on = 'name') print(s2v2loc_merge)
taken person quant reading ident site dated name lat long 0 619 dyer rad 9.82 619 DR-1 1927-02-08 DR-1 -49.85 -128.57 1 619 dyer sal 0.13 619 DR-1 1927-02-08 DR-1 -49.85 -128.57 2 622 dyer rad 7.80 622 DR-1 1927-02-10 DR-1 -49.85 -128.57 3 622 dyer sal 0.09 622 DR-1 1927-02-10 DR-1 -49.85 -128.57 4 734 pb rad 8.41 734 DR-3 1939-01-07 DR-3 -47.15 -126.72 5 734 lake sal 0.05 734 DR-3 1939-01-07 DR-3 -47.15 -126.72 6 734 pb temp -21.50 734 DR-3 1939-01-07 DR-3 -47.15 -126.72 7 735 pb rad 7.22 735 DR-3 1930-01-12 DR-3 -47.15 -126.72 8 735 NaN sal 0.06 735 DR-3 1930-01-12 DR-3 -47.15 -126.72 9 735 NaN temp -26.00 735 DR-3 1930-01-12 DR-3 -47.15 -126.72 10 751 pb rad 4.35 751 DR-3 1930-02-26 DR-3 -47.15 -126.72 11 751 pb temp -18.50 751 DR-3 1930-02-26 DR-3 -47.15 -126.72 12 751 lake sal 0.10 751 DR-3 1930-02-26 DR-3 -47.15 -126.72 13 752 lake rad 2.19 752 DR-3 NaN DR-3 -47.15 -126.72 14 752 lake sal 0.09 752 DR-3 NaN DR-3 -47.15 -126.72 15 752 lake temp -16.00 752 DR-3 NaN DR-3 -47.15 -126.72 16 752 roe sal 41.60 752 DR-3 NaN DR-3 -47.15 -126.72 17 837 lake rad 1.46 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40 18 837 lake sal 0.21 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40 19 837 roe sal 22.50 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40 20 844 roe rad 11.25 844 DR-1 1932-03-22 DR-1 -49.85 -128.57
Lastly, we add the person information to this dataset.
merged = s2v2loc_merge.merge(person, how = 'left', left_on = 'person', right_on = 'ident') print(merged.head())
taken person quant reading ... long ident_y personal family 0 619 dyer rad 9.82 ... -128.57 dyer William Dyer 1 619 dyer sal 0.13 ... -128.57 dyer William Dyer 2 622 dyer rad 7.80 ... -128.57 dyer William Dyer 3 622 dyer sal 0.09 ... -128.57 dyer William Dyer 4 734 pb rad 8.41 ... -126.72 pb Frank Pabodie [5 rows x 13 columns]
You can merge based on multiple columns as long as they match up.
ps = person.merge(survey, left_on = 'ident', right_on = 'person') vs = visited.merge(survey, left_on = 'ident', right_on = 'taken') print(ps)
ident personal family taken person quant reading 0 dyer William Dyer 619 dyer rad 9.82 1 dyer William Dyer 619 dyer sal 0.13 2 dyer William Dyer 622 dyer rad 7.80 3 dyer William Dyer 622 dyer sal 0.09 4 pb Frank Pabodie 734 pb rad 8.41 5 pb Frank Pabodie 734 pb temp -21.50 6 pb Frank Pabodie 735 pb rad 7.22 7 pb Frank Pabodie 751 pb rad 4.35 8 pb Frank Pabodie 751 pb temp -18.50 9 lake Anderson Lake 734 lake sal 0.05 10 lake Anderson Lake 751 lake sal 0.10 11 lake Anderson Lake 752 lake rad 2.19 12 lake Anderson Lake 752 lake sal 0.09 13 lake Anderson Lake 752 lake temp -16.00 14 lake Anderson Lake 837 lake rad 1.46 15 lake Anderson Lake 837 lake sal 0.21 16 roe Valentina Roerich 752 roe sal 41.60 17 roe Valentina Roerich 837 roe sal 22.50 18 roe Valentina Roerich 844 roe rad 11.25
print(vs)
ident site dated taken person quant reading 0 619 DR-1 1927-02-08 619 dyer rad 9.82 1 619 DR-1 1927-02-08 619 dyer sal 0.13 2 622 DR-1 1927-02-10 622 dyer rad 7.80 3 622 DR-1 1927-02-10 622 dyer sal 0.09 4 734 DR-3 1939-01-07 734 pb rad 8.41 5 734 DR-3 1939-01-07 734 lake sal 0.05 6 734 DR-3 1939-01-07 734 pb temp -21.50 7 735 DR-3 1930-01-12 735 pb rad 7.22 8 735 DR-3 1930-01-12 735 NaN sal 0.06 9 735 DR-3 1930-01-12 735 NaN temp -26.00 10 751 DR-3 1930-02-26 751 pb rad 4.35 11 751 DR-3 1930-02-26 751 pb temp -18.50 12 751 DR-3 1930-02-26 751 lake sal 0.10 13 752 DR-3 NaN 752 lake rad 2.19 14 752 DR-3 NaN 752 lake sal 0.09 15 752 DR-3 NaN 752 lake temp -16.00 16 752 DR-3 NaN 752 roe sal 41.60 17 837 MSK-4 1932-01-14 837 lake rad 1.46 18 837 MSK-4 1932-01-14 837 lake sal 0.21 19 837 MSK-4 1932-01-14 837 roe sal 22.50 20 844 DR-1 1932-03-22 844 roe rad 11.25
ps_vs = ps.merge(vs, left_on = ['ident','taken', 'quant','reading'], right_on = ['person','ident','quant','reading']) # The keys need to correspond ps_vs.head()
ident_x personal family taken_x ... site dated taken_y person_y 0 dyer William Dyer 619 ... DR-1 1927-02-08 619 dyer 1 dyer William Dyer 619 ... DR-1 1927-02-08 619 dyer 2 dyer William Dyer 622 ... DR-1 1927-02-10 622 dyer 3 dyer William Dyer 622 ... DR-1 1927-02-10 622 dyer 4 pb Frank Pabodie 734 ... DR-3 1939-01-07 734 pb [5 rows x 12 columns]
Note that since there are common column names, the merge appends _x
and _y
to denote which column came from the left and right, respectively.
Tidy data principles and reshaping datasets
The tidy data principle is a principle espoused by Dr. Hadley Wickham, one of the foremost R developers. Tidy data is a structure for datasets to make them more easily analyzed on computers. The basic principles are
- Each row is an observation
- Each column is a variable
- Each type of observational unit forms a table
Tidy data is tidy in one way. Untidy data can be untidy in many ways
Let's look at some examples.
from glob import glob filenames = sorted(glob('data/table*.csv')) # find files matching pattern. I know there are 6 of them table1, table2, table3, table4a, table4b, table5 = [pd.read_csv(f) for f in filenames] # Use a list comprehension
This code imports data from 6 files matching a pattern. Python allows multiple assignments on the left of the =
,
and as each dataset is imported, it gets assigned in order to the
variables on the left. In the second line, I sort the file names so that
they match the order in which I'm storing them in the 3rd line. The
function glob
does pattern-matching of file names.
The following tables refer to the number of TB cases and population in Afghanistan, Brazil, and China in 1999 and 2000
print(table1)
country year cases population 0 Afghanistan 1999 745 19987071 1 Afghanistan 2000 2666 20595360 2 Brazil 1999 37737 172006362 3 Brazil 2000 80488 174504898 4 China 1999 212258 1272915272 5 China 2000 213766 1280428583
print(table2)
country year type count 0 Afghanistan 1999 cases 745 1 Afghanistan 1999 population 19987071 2 Afghanistan 2000 cases 2666 3 Afghanistan 2000 population 20595360 4 Brazil 1999 cases 37737 5 Brazil 1999 population 172006362 6 Brazil 2000 cases 80488 7 Brazil 2000 population 174504898 8 China 1999 cases 212258 9 China 1999 population 1272915272 10 China 2000 cases 213766 11 China 2000 population 1280428583
print(table3)
country year rate 0 Afghanistan 1999 745/19987071 1 Afghanistan 2000 2666/20595360 2 Brazil 1999 37737/172006362 3 Brazil 2000 80488/174504898 4 China 1999 212258/1272915272 5 China 2000 213766/1280428583
print(table4a) # cases
country 1999 2000 0 Afghanistan 745 2666 1 Brazil 37737 80488 2 China 212258 213766
print(table4b) # population
country 1999 2000 0 Afghanistan 19987071 20595360 1 Brazil 172006362 174504898 2 China 1272915272 1280428583
print(table5)
country century year rate 0 Afghanistan 19 99 745/19987071 1 Afghanistan 20 0 2666/20595360 2 Brazil 19 99 37737/172006362 3 Brazil 20 0 80488/174504898 4 China 19 99 212258/1272915272 5 China 20 0 213766/1280428583
Exercise: Describe why and why not each of these datasets are tidy.
Melting (unpivoting) data
Melting is the operation of collapsing multiple columns into 2 columns, where one column is formed by the old column names and the other by the corresponding values. Some columns may be kept fixed, and their data are repeated to maintain the interrelationships between the variables.
We'll start with loading some data on income and religion in the US from the Pew Research Center.
pew = pd.read_csv('data/pew.csv') print(pew.head())
religion <$10k $10-20k ... $100-150k >150k Don't know/refused 0 Agnostic 27 34 ... 109 84 96 1 Atheist 12 27 ... 59 74 76 2 Buddhist 27 21 ... 39 53 54 3 Catholic 418 617 ... 792 633 1489 4 Don't know/refused 15 14 ... 17 18 116 [5 rows x 11 columns]
This dataset is considered to be in "wide" format. There are several issues with it, including the fact that column headers have data. Those column headers are income groups that should be a column by tidy principles. Our job is to turn this dataset into "long" format with a column for income group.
We will use the function melt
to achieve this. This takes a few parameters:
- id_vars is a list of variables that will remain as is
- value_vars is a list of column names that we will melt (or unpivot). By default, it will melt all columns not mentioned in id_vars
- var_name is a string giving the name of the new column created by the headers (default:
variable
) - value_name is a string giving the name of the new column created by the values (default:
value
)
pew_long = pew.melt(id_vars = ['religion'], var_name = 'income_group', value_name = 'count') print(pew_long.head())
religion income_group count 0 Agnostic <$10k 27 1 Atheist <$10k 12 2 Buddhist <$10k 27 3 Catholic <$10k 418 4 Don't know/refused <$10k 15
Separating columns containing multiple variables
We will use an Ebola dataset to illustrate this principle
ebola = pd.read_csv('data/country_timeseries.csv') print(ebola.head())
Date Day ... Deaths_Spain Deaths_Mali 0 1/5/2015 289 ... NaN NaN 1 1/4/2015 288 ... NaN NaN 2 1/3/2015 287 ... NaN NaN 3 1/2/2015 286 ... NaN NaN 4 12/31/2014 284 ... NaN NaN [5 rows x 18 columns]
Note that for each country, we have two columns – one for cases (number infected) and one for deaths. Ideally, we want one column for country, one for cases, and one for deaths.
The first step will be to melt this data set so that the column headers in question form a column and the corresponding data forms a second column.
ebola_long = ebola.melt(id_vars = ['Date','Day']) print(ebola_long.head())
Date Day variable value 0 1/5/2015 289 Cases_Guinea 2776.0 1 1/4/2015 288 Cases_Guinea 2775.0 2 1/3/2015 287 Cases_Guinea 2769.0 3 1/2/2015 286 Cases_Guinea NaN 4 12/31/2014 284 Cases_Guinea 2730.0
We now need to split the data in the variable
column to
make two columns. One will contain the country name and the other either
Cases or Deaths. We will use some string manipulation functions that we
will see later to achieve this.
variable_split = ebola_long['variable'].str.split('_', expand=True) # split on the '_' character print(variable_split[:5])
0 1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea
The expand=True
option forces the creation of an DataFrame
rather than a list
type(variable_split)
<class 'pandas.core.frame.DataFrame'>
We can now concatenate this to the original data
variable_split.columns = ['status','country'] ebola_parsed = pd.concat([ebola_long, variable_split], axis = 1) ebola_parsed.drop('variable', axis = 1, inplace=True) # Remove the column named "variable" and replace the old data with the new one in the same location print(ebola_parsed.head())
Date Day value status country 0 1/5/2015 289 2776.0 Cases Guinea 1 1/4/2015 288 2775.0 Cases Guinea 2 1/3/2015 287 2769.0 Cases Guinea 3 1/2/2015 286 NaN Cases Guinea 4 12/31/2014 284 2730.0 Cases Guinea
Pivot/spread datasets
If we wanted to, we could also make two columns based on cases and
deaths, so for each country and date, you could easily read off the cases
and deaths. This is achieved using the pivot_table
function.
In the pivot_table
syntax, index
refers to the columns we don't want to change, columns
refers to the column whose values will form the column names of the new columns, and values
is the name of the column that will form the values in the pivoted dataset.
ebola_parsed.pivot_table(index = ['Date','Day', 'country'], columns = 'status', values = 'value')
status Cases Deaths Date Day country 1/2/2015 286 Liberia 8157.0 3496.0 1/3/2015 287 Guinea 2769.0 1767.0 Liberia 8166.0 3496.0 SierraLeone 9722.0 2915.0 1/4/2015 288 Guinea 2775.0 1781.0 ... ... ... 9/7/2014 169 Liberia 2081.0 1137.0 Nigeria 21.0 8.0 Senegal 3.0 0.0 SierraLeone 1424.0 524.0 9/9/2014 171 Liberia 2407.0 NaN [375 rows x 2 columns]
This creates something called MultiIndex
in the pandas
DataFrame
. This is useful in some advanced cases, but here, we just want a normal DataFrame
back. We can achieve that by using the reset_index
function.
ebola_parsed.pivot_table(index = ['Date','Day','country'], columns = 'status', values = 'value').reset_index()
status Date Day country Cases Deaths 0 1/2/2015 286 Liberia 8157.0 3496.0 1 1/3/2015 287 Guinea 2769.0 1767.0 2 1/3/2015 287 Liberia 8166.0 3496.0 3 1/3/2015 287 SierraLeone 9722.0 2915.0 4 1/4/2015 288 Guinea 2775.0 1781.0 .. ... ... ... ... ... 370 9/7/2014 169 Liberia 2081.0 1137.0 371 9/7/2014 169 Nigeria 21.0 8.0 372 9/7/2014 169 Senegal 3.0 0.0 373 9/7/2014 169 SierraLeone 1424.0 524.0 374 9/9/2014 171 Liberia 2407.0 NaN [375 rows x 5 columns]
Pivoting is a 2-column to many-column operation, with the number of
columns formed depending on the number of unique values present in the
column of the original data that is entered into the columns
argument of pivot_table
Exercise: Load the file weather.csv
into Python and work on making it a tidy dataset. It requires melting
and pivoting. The dataset comprises the maximum and minimum
temperatures recorded each day in 2010. There are lots of missing values.
Ultimately we want columns for days of the month, maximum temperature, and minimum temperature, along with the location ID, the year, and the
month.
Source: Abhijit Dasgupta, https://www.araastat.com/BIOF085/pandas.html#data-transformation
This work is licensed under a Creative Commons Attribution 4.0 License.