pandas Data Structures

With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations.

Data Transformation

Arithmetic operations

If you have a Series or DataFrame that is all numeric, you can add or multiply single numbers to all the elements together.

A = pd.DataFrame(np.random.randn(4,5))
print(A)
          0         1         2         3         4
0  0.243690  1.076332  0.950349  1.373858  0.868005
1  2.712568  1.046159 -2.044472 -0.630508  0.945787
2  0.413522  1.938513 -0.195251  0.350383  1.381913
3 -0.927886 -1.882394 -2.163616 -0.142430  1.091522
print(A + 6)
          0         1         2         3         4
0  6.243690  7.076332  6.950349  7.373858  6.868005
1  8.712568  7.046159  3.955528  5.369492  6.945787
2  6.413522  7.938513  5.804749  6.350383  7.381913
3  5.072114  4.117606  3.836384  5.857570  7.091522
print(A * -10)
           0          1          2          3          4
0  -2.436896 -10.763319  -9.503492 -13.738582  -8.680051
1 -27.125677 -10.461590  20.444719   6.305081  -9.457871
2  -4.135220 -19.385126   1.952507  -3.503830 -13.819127
3   9.278864  18.823936  21.636155   1.424302 -10.915223

If you have two compatible (same dimension) numeric DataFrames, you can add, subtract, multiply and divide elementwise

B = pd.DataFrame(np.random.randn(4,5) + 4)
print(A + B)
           0         1         2         3         4
0  4.018707  5.687454  3.580436  6.782426  5.120555
1  5.253674  6.124992  1.169963  2.695430  5.307602
2  4.217585  5.948985  4.620141  4.040466  4.649786
3  3.890115  2.221573  1.245353  5.396740  5.127738
print(A * B)
          0         1         2         3         4
0  0.919932  4.963098  2.499501  7.430605  3.691235
1  6.892923  5.313267 -6.571821 -2.097031  4.125348
2  1.573064  7.774351 -0.940208  1.292943  4.515915
3 -4.470558 -7.725281 -7.375697 -0.788945  4.405619

If you have a Series with the same number of elements as the number of columns of a DataFrame, you can do arithmetic operations, with each element of the Series acting upon each column of the DataFrame

c = pd.Series([1,2,3,4,5])
print(A + c)
          0         1         2         3         4
0  1.243690  3.076332  3.950349  5.373858  5.868005
1  3.712568  3.046159  0.955528  3.369492  5.945787
2  1.413522  3.938513  2.804749  4.350383  6.381913
3  0.072114  0.117606  0.836384  3.857570  6.091522
print(A * c)
          0         1         2         3         4
0  0.243690  2.152664  2.851048  5.495433  4.340025
1  2.712568  2.092318 -6.133416 -2.522032  4.728935
2  0.413522  3.877025 -0.585752  1.401532  6.909563
3 -0.927886 -3.764787 -6.490847 -0.569721  5.457611

This idea can be used to standardize a dataset, i.e. make each column have mean 0 and standard deviation 1.

means = A.mean(axis=0)
stds = A.std(axis = 0)

(A - means)/stds
         0         1         2         3         4
0 -0.240828  0.318354  1.202702  1.326047 -0.899601
1  1.380224  0.300287 -0.783339 -1.013572 -0.556263
2 -0.129317  0.834602  0.442988  0.131384  1.368838
3 -1.010079 -1.453244 -0.862351 -0.443858  0.087026


Concatenation of data sets

Let's create some example data sets

df1 = pd.DataFrame({'A': ['a'+str(i) for i in range(4)],
    'B': ['b'+str(i) for i in range(4)],
    'C': ['c'+str(i) for i in range(4)],
    'D': ['d'+str(i) for i in range(4)]})

df2 =  pd.DataFrame({'A': ['a'+str(i) for i in range(4,8)],
    'B': ['b'+str(i) for i in range(4,8)],
    'C': ['c'+str(i) for i in range(4,8)],
    'D': ['d'+str(i) for i in range(4,8)]})
df3 =  pd.DataFrame({'A': ['a'+str(i) for i in range(8,12)],
    'B': ['b'+str(i) for i in range(8,12)],
    'C': ['c'+str(i) for i in range(8,12)],
    'D': ['d'+str(i) for i in range(8,12)]})

We can concatenate these DataFrame objects by row

row_concatenate = pd.concat([df1, df2, df3])
print(row_concatenate)
     A    B    C    D
0   a0   b0   c0   d0
1   a1   b1   c1   d1
2   a2   b2   c2   d2
3   a3   b3   c3   d3
0   a4   b4   c4   d4
1   a5   b5   c5   d5
2   a6   b6   c6   d6
3   a7   b7   c7   d7
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

This stacks the dataframes together. They are literally stacked, as is evidenced by the index values being repeated.

This same exercise can be done by the append function

df1.append(df2).append(df3)
     A    B    C    D
0   a0   b0   c0   d0
1   a1   b1   c1   d1
2   a2   b2   c2   d2
3   a3   b3   c3   d3
0   a4   b4   c4   d4
1   a5   b5   c5   d5
2   a6   b6   c6   d6
3   a7   b7   c7   d7
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

Suppose we want to append a new row to df1. Let's create a new row.

new_row = pd.Series(['n1','n2','n3','n4'])
pd.concat([df1, new_row])
     A    B    C    D    0
0   a0   b0   c0   d0  NaN
1   a1   b1   c1   d1  NaN
2   a2   b2   c2   d2  NaN
3   a3   b3   c3   d3  NaN
0  NaN  NaN  NaN  NaN   n1
1  NaN  NaN  NaN  NaN   n2
2  NaN  NaN  NaN  NaN   n3
3  NaN  NaN  NaN  NaN   n4

That's a lot of missing values. The issue is that we don't have column names in the new_row, and the indices are the same, so pandas tries to append it by making a new column. The solution is to make it a DataFrame.

new_row = pd.DataFrame([['n1','n2','n3','n4']], columns = ['A','B','C','D'])
print(new_row)
    A   B   C   D
0  n1  n2  n3  n4
pd.concat([df1, new_row])
     A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
0  n1  n2  n3  n4

or

df1.append(new_row)
     A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
0  n1  n2  n3  n4


Adding columns

pd.concat([df1,df2,df3], axis = 1)
     A   B   C   D   A   B   C   D    A    B    C    D
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11

The option axis=1 ensures that concatenation happens by columns. The default value axis = 0 concatenates by rows.

Let's play a little game. Let's change the column names of df2 and df3 so they are not the same as df1.

df2.columns = ['E','F','G','H']
df3.columns = ['A','D','F','H']
pd.concat([df1,df2,df3])
     A    B    C    D    E    F    G    H
0   a0   b0   c0   d0  NaN  NaN  NaN  NaN
1   a1   b1   c1   d1  NaN  NaN  NaN  NaN
2   a2   b2   c2   d2  NaN  NaN  NaN  NaN
3   a3   b3   c3   d3  NaN  NaN  NaN  NaN
0  NaN  NaN  NaN  NaN   a4   b4   c4   d4
1  NaN  NaN  NaN  NaN   a5   b5   c5   d5
2  NaN  NaN  NaN  NaN   a6   b6   c6   d6
3  NaN  NaN  NaN  NaN   a7   b7   c7   d7
0   a8  NaN  NaN   b8  NaN   c8  NaN   d8
1   a9  NaN  NaN   b9  NaN   c9  NaN   d9
2  a10  NaN  NaN  b10  NaN  c10  NaN  d10
3  a11  NaN  NaN  b11  NaN  c11  NaN  d11

Now pandas ensures that all column names are represented in the new data frame, but with missing values where the row indices and column indices are mismatched. Some of this can be avoided by only joining on common columns. This is done using the join option ir concat. The default value is 'outer', which is what you see. above

 pd.concat([df1, df3], join = 'inner')
     A    D
0   a0   d0
1   a1   d1
2   a2   d2
3   a3   d3
0   a8   b8
1   a9   b9
2  a10  b10
3  a11  b11

You can do the same thing when joining by rows, using axis = 0 and join="inner" to only join on rows with matching indices. Reminder that the indices are just labels and happen to be the row numbers by default.


Merging data sets

For this section we'll use a set of data from a survey, also used by Daniel Chen in "Pandas for Everyone"

person = pd.read_csv('data/survey_person.csv')
site = pd.read_csv('data/survey_site.csv')
survey = pd.read_csv('data/survey_survey.csv')
visited = pd.read_csv('data/survey_visited.csv')
print(person)
      ident   personal    family
0      dyer    William      Dyer
1        pb      Frank   Pabodie
2      lake   Anderson      Lake
3       roe  Valentina   Roerich
4  danforth      Frank  Danforth
 print(site)
    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
 print(survey)
     taken person quant  reading
0     619   dyer   rad     9.82
1     619   dyer   sal     0.13
2     622   dyer   rad     7.80
3     622   dyer   sal     0.09
4     734     pb   rad     8.41
5     734   lake   sal     0.05
6     734     pb  temp   -21.50
7     735     pb   rad     7.22
8     735    NaN   sal     0.06
9     735    NaN  temp   -26.00
10    751     pb   rad     4.35
11    751     pb  temp   -18.50
12    751   lake   sal     0.10
13    752   lake   rad     2.19
14    752   lake   sal     0.09
15    752   lake  temp   -16.00
16    752    roe   sal    41.60
17    837   lake   rad     1.46
18    837   lake   sal     0.21
19    837    roe   sal    22.50
20    844    roe   rad    11.25
 print(visited)
    ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

There are basically four kinds of joins:

pandas R SQL Description
left left_join left outer keep all rows on the left
right right_join right outer keep all rows on the right
outer outer_join full outer keep all rows from both
inner inner_join inner keep only rows with common keys



The terms left and right refer to which data set you call first and second, respectively.

We start with a left join

 s2v_merge = survey.merge(visited, left_on = 'taken',right_on = 'ident', how = 'left')
 print(s2v_merge)
     taken person quant  reading  ident   site       dated
0     619   dyer   rad     9.82    619   DR-1  1927-02-08
1     619   dyer   sal     0.13    619   DR-1  1927-02-08
2     622   dyer   rad     7.80    622   DR-1  1927-02-10
3     622   dyer   sal     0.09    622   DR-1  1927-02-10
4     734     pb   rad     8.41    734   DR-3  1939-01-07
5     734   lake   sal     0.05    734   DR-3  1939-01-07
6     734     pb  temp   -21.50    734   DR-3  1939-01-07
7     735     pb   rad     7.22    735   DR-3  1930-01-12
8     735    NaN   sal     0.06    735   DR-3  1930-01-12
9     735    NaN  temp   -26.00    735   DR-3  1930-01-12
10    751     pb   rad     4.35    751   DR-3  1930-02-26
11    751     pb  temp   -18.50    751   DR-3  1930-02-26
12    751   lake   sal     0.10    751   DR-3  1930-02-26
13    752   lake   rad     2.19    752   DR-3         NaN
14    752   lake   sal     0.09    752   DR-3         NaN
15    752   lake  temp   -16.00    752   DR-3         NaN
16    752    roe   sal    41.60    752   DR-3         NaN
17    837   lake   rad     1.46    837  MSK-4  1932-01-14
18    837   lake   sal     0.21    837  MSK-4  1932-01-14
19    837    roe   sal    22.50    837  MSK-4  1932-01-14
20    844    roe   rad    11.25    844   DR-1  1932-03-22

Here, the left dataset is survey , and the right one is visited. Since we're doing a left join, we keed all the rows from survey and add columns from visited, matching on the common key, called "taken" in one dataset and "ident" in the other. Note that the rows of visited are repeated as needed to line up with all the rows with common "taken" values.

We can now add location information, where the common key is the site code

s2v2loc_merge = s2v_merge.merge(site, how = 'left', left_on = 'site', right_on = 'name')
print(s2v2loc_merge)
     taken person quant  reading  ident   site       dated   name    lat    long
0     619   dyer   rad     9.82    619   DR-1  1927-02-08   DR-1 -49.85 -128.57
1     619   dyer   sal     0.13    619   DR-1  1927-02-08   DR-1 -49.85 -128.57
2     622   dyer   rad     7.80    622   DR-1  1927-02-10   DR-1 -49.85 -128.57
3     622   dyer   sal     0.09    622   DR-1  1927-02-10   DR-1 -49.85 -128.57
4     734     pb   rad     8.41    734   DR-3  1939-01-07   DR-3 -47.15 -126.72
5     734   lake   sal     0.05    734   DR-3  1939-01-07   DR-3 -47.15 -126.72
6     734     pb  temp   -21.50    734   DR-3  1939-01-07   DR-3 -47.15 -126.72
7     735     pb   rad     7.22    735   DR-3  1930-01-12   DR-3 -47.15 -126.72
8     735    NaN   sal     0.06    735   DR-3  1930-01-12   DR-3 -47.15 -126.72
9     735    NaN  temp   -26.00    735   DR-3  1930-01-12   DR-3 -47.15 -126.72
10    751     pb   rad     4.35    751   DR-3  1930-02-26   DR-3 -47.15 -126.72
11    751     pb  temp   -18.50    751   DR-3  1930-02-26   DR-3 -47.15 -126.72
12    751   lake   sal     0.10    751   DR-3  1930-02-26   DR-3 -47.15 -126.72
13    752   lake   rad     2.19    752   DR-3         NaN   DR-3 -47.15 -126.72
14    752   lake   sal     0.09    752   DR-3         NaN   DR-3 -47.15 -126.72
15    752   lake  temp   -16.00    752   DR-3         NaN   DR-3 -47.15 -126.72
16    752    roe   sal    41.60    752   DR-3         NaN   DR-3 -47.15 -126.72
17    837   lake   rad     1.46    837  MSK-4  1932-01-14  MSK-4 -48.87 -123.40
18    837   lake   sal     0.21    837  MSK-4  1932-01-14  MSK-4 -48.87 -123.40
19    837    roe   sal    22.50    837  MSK-4  1932-01-14  MSK-4 -48.87 -123.40
20    844    roe   rad    11.25    844   DR-1  1932-03-22   DR-1 -49.85 -128.57

Lastly, we add the person information to this dataset.

merged = s2v2loc_merge.merge(person, how = 'left', left_on = 'person', right_on = 'ident')
print(merged.head())
    taken person quant  reading  ...    long ident_y personal   family
0    619   dyer   rad     9.82  ... -128.57    dyer  William     Dyer
1    619   dyer   sal     0.13  ... -128.57    dyer  William     Dyer
2    622   dyer   rad     7.80  ... -128.57    dyer  William     Dyer
3    622   dyer   sal     0.09  ... -128.57    dyer  William     Dyer
4    734     pb   rad     8.41  ... -126.72      pb    Frank  Pabodie

[5 rows x 13 columns]

You can merge based on multiple columns as long as they match up.

ps = person.merge(survey, left_on = 'ident', right_on = 'person')
vs = visited.merge(survey, left_on = 'ident', right_on = 'taken')
print(ps)
   ident   personal   family  taken person quant  reading
0   dyer    William     Dyer    619   dyer   rad     9.82
1   dyer    William     Dyer    619   dyer   sal     0.13
2   dyer    William     Dyer    622   dyer   rad     7.80
3   dyer    William     Dyer    622   dyer   sal     0.09
4     pb      Frank  Pabodie    734     pb   rad     8.41
5     pb      Frank  Pabodie    734     pb  temp   -21.50
6     pb      Frank  Pabodie    735     pb   rad     7.22
7     pb      Frank  Pabodie    751     pb   rad     4.35
8     pb      Frank  Pabodie    751     pb  temp   -18.50
9   lake   Anderson     Lake    734   lake   sal     0.05
10  lake   Anderson     Lake    751   lake   sal     0.10
11  lake   Anderson     Lake    752   lake   rad     2.19
12  lake   Anderson     Lake    752   lake   sal     0.09
13  lake   Anderson     Lake    752   lake  temp   -16.00
14  lake   Anderson     Lake    837   lake   rad     1.46
15  lake   Anderson     Lake    837   lake   sal     0.21
16   roe  Valentina  Roerich    752    roe   sal    41.60
17   roe  Valentina  Roerich    837    roe   sal    22.50
18   roe  Valentina  Roerich    844    roe   rad    11.25
 print(vs)
     ident   site       dated  taken person quant  reading
0     619   DR-1  1927-02-08    619   dyer   rad     9.82
1     619   DR-1  1927-02-08    619   dyer   sal     0.13
2     622   DR-1  1927-02-10    622   dyer   rad     7.80
3     622   DR-1  1927-02-10    622   dyer   sal     0.09
4     734   DR-3  1939-01-07    734     pb   rad     8.41
5     734   DR-3  1939-01-07    734   lake   sal     0.05
6     734   DR-3  1939-01-07    734     pb  temp   -21.50
7     735   DR-3  1930-01-12    735     pb   rad     7.22
8     735   DR-3  1930-01-12    735    NaN   sal     0.06
9     735   DR-3  1930-01-12    735    NaN  temp   -26.00
10    751   DR-3  1930-02-26    751     pb   rad     4.35
11    751   DR-3  1930-02-26    751     pb  temp   -18.50
12    751   DR-3  1930-02-26    751   lake   sal     0.10
13    752   DR-3         NaN    752   lake   rad     2.19
14    752   DR-3         NaN    752   lake   sal     0.09
15    752   DR-3         NaN    752   lake  temp   -16.00
16    752   DR-3         NaN    752    roe   sal    41.60
17    837  MSK-4  1932-01-14    837   lake   rad     1.46
18    837  MSK-4  1932-01-14    837   lake   sal     0.21
19    837  MSK-4  1932-01-14    837    roe   sal    22.50
20    844   DR-1  1932-03-22    844    roe   rad    11.25
ps_vs = ps.merge(vs, 
                left_on = ['ident','taken', 'quant','reading'],
                right_on = ['person','ident','quant','reading']) # The keys need to correspond
ps_vs.head()
  ident_x personal   family  taken_x  ...  site       dated  taken_y  person_y
0    dyer  William     Dyer      619  ...  DR-1  1927-02-08      619      dyer
1    dyer  William     Dyer      619  ...  DR-1  1927-02-08      619      dyer
2    dyer  William     Dyer      622  ...  DR-1  1927-02-10      622      dyer
3    dyer  William     Dyer      622  ...  DR-1  1927-02-10      622      dyer
4      pb    Frank  Pabodie      734  ...  DR-3  1939-01-07      734        pb

[5 rows x 12 columns]

Note that since there are common column names, the merge appends _x and _y to denote which column came from the left and right, respectively.


Tidy data principles and reshaping datasets

The tidy data principle is a principle espoused by Dr. Hadley Wickham, one of the foremost R developers. Tidy data is a structure for datasets to make them more easily analyzed on computers. The basic principles are

  • Each row is an observation
  • Each column is a variable
  • Each type of observational unit forms a table

Tidy data is tidy in one way. Untidy data can be untidy in many ways

Let's look at some examples.

from glob import glob
filenames = sorted(glob('data/table*.csv')) # find files matching pattern. I know there are 6 of them
table1, table2, table3, table4a, table4b, table5 = [pd.read_csv(f) for f in filenames] # Use a list comprehension

This code imports data from 6 files matching a pattern. Python allows multiple assignments on the left of the =, and as each dataset is imported, it gets assigned in order to the variables on the left. In the second line, I sort the file names so that they match the order in which I'm storing them in the 3rd line. The function glob does pattern-matching of file names.

The following tables refer to the number of TB cases and population in Afghanistan, Brazil, and China in 1999 and 2000

 print(table1)
       country  year   cases  population
0  Afghanistan  1999     745    19987071
1  Afghanistan  2000    2666    20595360
2       Brazil  1999   37737   172006362
3       Brazil  2000   80488   174504898
4        China  1999  212258  1272915272
5        China  2000  213766  1280428583
 print(table2)
        country  year        type       count
0   Afghanistan  1999       cases         745
1   Afghanistan  1999  population    19987071
2   Afghanistan  2000       cases        2666
3   Afghanistan  2000  population    20595360
4        Brazil  1999       cases       37737
5        Brazil  1999  population   172006362
6        Brazil  2000       cases       80488
7        Brazil  2000  population   174504898
8         China  1999       cases      212258
9         China  1999  population  1272915272
10        China  2000       cases      213766
11        China  2000  population  1280428583
 print(table3)
       country  year               rate
0  Afghanistan  1999       745/19987071
1  Afghanistan  2000      2666/20595360
2       Brazil  1999    37737/172006362
3       Brazil  2000    80488/174504898
4        China  1999  212258/1272915272
5        China  2000  213766/1280428583
print(table4a) # cases
       country    1999    2000
0  Afghanistan     745    2666
1       Brazil   37737   80488
2        China  212258  213766
 print(table4b) # population
       country        1999        2000
0  Afghanistan    19987071    20595360
1       Brazil   172006362   174504898
2        China  1272915272  1280428583
print(table5)
       country  century  year               rate
0  Afghanistan       19    99       745/19987071
1  Afghanistan       20     0      2666/20595360
2       Brazil       19    99    37737/172006362
3       Brazil       20     0    80488/174504898
4        China       19    99  212258/1272915272
5        China       20     0  213766/1280428583

Exercise: Describe why and why not each of these datasets are tidy.


Melting (unpivoting) data

Melting is the operation of collapsing multiple columns into 2 columns, where one column is formed by the old column names and the other by the corresponding values. Some columns may be kept fixed, and their data are repeated to maintain the interrelationships between the variables.

We'll start with loading some data on income and religion in the US from the Pew Research Center.

pew = pd.read_csv('data/pew.csv')
print(pew.head())
              religion  <$10k  $10-20k  ...  $100-150k  >150k  Don't know/refused
0            Agnostic     27       34  ...        109     84                  96
1             Atheist     12       27  ...         59     74                  76
2            Buddhist     27       21  ...         39     53                  54
3            Catholic    418      617  ...        792    633                1489
4  Don't know/refused     15       14  ...         17     18                 116

[5 rows x 11 columns]

This dataset is considered to be in "wide" format. There are several issues with it, including the fact that column headers have data. Those column headers are income groups that should be a column by tidy principles. Our job is to turn this dataset into "long" format with a column for income group.

We will use the function melt to achieve this. This takes a few parameters:

  • id_vars is a list of variables that will remain as is
  • value_vars is a list of column names that we will melt (or unpivot). By default, it will melt all columns not mentioned in id_vars
  • var_name is a string giving the name of the new column created by the headers (default: variable)
  • value_name is a string giving the name of the new column created by the values (default: value)
pew_long = pew.melt(id_vars = ['religion'], var_name = 'income_group', value_name = 'count')
print(pew_long.head())
             religion income_group  count
0            Agnostic        <$10k     27
1             Atheist        <$10k     12
2            Buddhist        <$10k     27
3            Catholic        <$10k    418
4  Don't know/refused        <$10k     15


Separating columns containing multiple variables

We will use an Ebola dataset to illustrate this principle

ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.head())
          Date  Day  ...  Deaths_Spain  Deaths_Mali
0    1/5/2015  289  ...           NaN          NaN
1    1/4/2015  288  ...           NaN          NaN
2    1/3/2015  287  ...           NaN          NaN
3    1/2/2015  286  ...           NaN          NaN
4  12/31/2014  284  ...           NaN          NaN

[5 rows x 18 columns]

Note that for each country, we have two columns – one for cases (number infected) and one for deaths. Ideally, we want one column for country, one for cases, and one for deaths.

The first step will be to melt this data set so that the column headers in question form a column and the corresponding data forms a second column.

ebola_long = ebola.melt(id_vars = ['Date','Day'])
print(ebola_long.head())
         Date  Day      variable   value
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0

We now need to split the data in the variable column to make two columns. One will contain the country name and the other either Cases or Deaths. We will use some string manipulation functions that we will see later to achieve this.

variable_split = ebola_long['variable'].str.split('_', expand=True) # split on the '_' character
print(variable_split[:5])
     0       1
0  Cases  Guinea
1  Cases  Guinea
2  Cases  Guinea
3  Cases  Guinea
4  Cases  Guinea

The expand=True option forces the creation of an DataFrame rather than a list

type(variable_split)
<class 'pandas.core.frame.DataFrame'>

We can now concatenate this to the original data

variable_split.columns = ['status','country']

ebola_parsed = pd.concat([ebola_long, variable_split], axis = 1)

ebola_parsed.drop('variable', axis = 1, inplace=True) # Remove the column named "variable" and replace the old data with the new one in the same location

print(ebola_parsed.head())
         Date  Day   value status country
0    1/5/2015  289  2776.0  Cases  Guinea
1    1/4/2015  288  2775.0  Cases  Guinea
2    1/3/2015  287  2769.0  Cases  Guinea
3    1/2/2015  286     NaN  Cases  Guinea
4  12/31/2014  284  2730.0  Cases  Guinea


Pivot/spread datasets

If we wanted to, we could also make two columns based on cases and deaths, so for each country and date, you could easily read off the cases and deaths. This is achieved using the pivot_table function.

In the pivot_table syntax, index refers to the columns we don't want to change, columns refers to the column whose values will form the column names of the new columns, and values is the name of the column that will form the values in the pivoted dataset.

ebola_parsed.pivot_table(index = ['Date','Day', 'country'], columns = 'status', values = 'value')
status                     Cases  Deaths
Date     Day country                    
1/2/2015 286 Liberia      8157.0  3496.0
1/3/2015 287 Guinea       2769.0  1767.0
             Liberia      8166.0  3496.0
             SierraLeone  9722.0  2915.0
1/4/2015 288 Guinea       2775.0  1781.0
...                          ...     ...
9/7/2014 169 Liberia      2081.0  1137.0
             Nigeria        21.0     8.0
             Senegal         3.0     0.0
             SierraLeone  1424.0   524.0
9/9/2014 171 Liberia      2407.0     NaN

[375 rows x 2 columns]

This creates something called MultiIndex in the pandas DataFrame. This is useful in some advanced cases, but here, we just want a normal DataFrame back. We can achieve that by using the reset_index function.

ebola_parsed.pivot_table(index = ['Date','Day','country'], columns = 'status', values = 'value').reset_index()
status      Date  Day      country   Cases  Deaths
0       1/2/2015  286      Liberia  8157.0  3496.0
1       1/3/2015  287       Guinea  2769.0  1767.0
2       1/3/2015  287      Liberia  8166.0  3496.0
3       1/3/2015  287  SierraLeone  9722.0  2915.0
4       1/4/2015  288       Guinea  2775.0  1781.0
..           ...  ...          ...     ...     ...
370     9/7/2014  169      Liberia  2081.0  1137.0
371     9/7/2014  169      Nigeria    21.0     8.0
372     9/7/2014  169      Senegal     3.0     0.0
373     9/7/2014  169  SierraLeone  1424.0   524.0
374     9/9/2014  171      Liberia  2407.0     NaN

[375 rows x 5 columns]

Pivoting is a 2-column to many-column operation, with the number of columns formed depending on the number of unique values present in the column of the original data that is entered into the columns argument of pivot_table

Exercise: Load the file weather.csv into Python and work on making it a tidy dataset. It requires melting and pivoting. The dataset comprises the maximum and minimum temperatures recorded each day in 2010. There are lots of missing values. Ultimately we want columns for days of the month, maximum temperature, and minimum temperature, along with the location ID, the year, and the month.


Source: Abhijit Dasgupta, https://www.araastat.com/BIOF085/pandas.html#data-transformation
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.