Data Aggregation and Split-Apply-Combine

We'll use the Gapminder dataset for this section

df = pd.read_csv('data/gapminder.tsv', sep = '\t') # data is tab-separated, so we use '\t' to specify that

The paradigm we will be exploring is often called split-apply-combine or MapReduce or grouped aggregation. The basic idea is that you split a data set up by some feature, apply a recipe to each piece, compute the result, and then put the results back together into a dataset. This can be described in teh following schematic.


pandas is set up for this. It features the groupby function that allows the "split" part of the operation. We can then apply a function to each part and put it back together. Let's see how.

df.head()
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
f"This dataset has {len(df['country'].unique())} countries in it"
'This dataset has 142 countries in it'

One of the variables in this dataset is life expectancy at birth, lifeExp. Suppose we want to find the average life expectancy of each country over the period of study.

df.groupby('country')['lifeExp'].mean()
country
Afghanistan           37.478833
Albania               68.432917
Algeria               59.030167
Angola                37.883500
Argentina             69.060417
                        ...    
Vietnam               57.479500
West Bank and Gaza    60.328667
Yemen, Rep.           46.780417
Zambia                45.996333
Zimbabwe              52.663167
Name: lifeExp, Length: 142, dtype: float64

So what's going on here? First, we use the groupby function, telling pandas to split the dataset up by values of the column country.

df.groupby('country')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11cfb1b50>

pandas won't show you the actual data, but will tell you that it is a grouped dataframe object. This means that each element of this object is a DataFrame with data from one country.

df.groupby('country').ngroups
142
df.groupby('country').get_group('United Kingdom')
             country continent  year  lifeExp       pop     gdpPercap
1596  United Kingdom    Europe  1952   69.180  50430000   9979.508487
1597  United Kingdom    Europe  1957   70.420  51430000  11283.177950
1598  United Kingdom    Europe  1962   70.760  53292000  12477.177070
1599  United Kingdom    Europe  1967   71.360  54959000  14142.850890
1600  United Kingdom    Europe  1972   72.010  56079000  15895.116410
1601  United Kingdom    Europe  1977   72.760  56179000  17428.748460
1602  United Kingdom    Europe  1982   74.040  56339704  18232.424520
1603  United Kingdom    Europe  1987   75.007  56981620  21664.787670
1604  United Kingdom    Europe  1992   76.420  57866349  22705.092540
1605  United Kingdom    Europe  1997   77.218  58808266  26074.531360
1606  United Kingdom    Europe  2002   78.471  59912431  29478.999190
1607  United Kingdom    Europe  2007   79.425  60776238  33203.261280
type(df.groupby('country').get_group('United Kingdom'))
<class 'pandas.core.frame.DataFrame'>
avg_lifeexp_country = df.groupby('country').lifeExp.mean()
avg_lifeexp_country['United Kingdom']
73.92258333333332
df.groupby('country').get_group('United Kingdom').lifeExp.mean()
73.92258333333332

Let's look at if life expectancy has gone up over time, by continent

df.groupby(['continent','year']).lifeExp.mean()
continent  year
Africa     1952    39.135500
           1957    41.266346
           1962    43.319442
           1967    45.334538
           1972    47.450942
           1977    49.580423
           1982    51.592865
           1987    53.344788
           1992    53.629577
           1997    53.598269
           2002    53.325231
           2007    54.806038
Americas   1952    53.279840
           1957    55.960280
           1962    58.398760
           1967    60.410920
           1972    62.394920
           1977    64.391560
           1982    66.228840
           1987    68.090720
           1992    69.568360
           1997    71.150480
           2002    72.422040
           2007    73.608120
Asia       1952    46.314394
           1957    49.318544
           1962    51.563223
           1967    54.663640
           1972    57.319269
           1977    59.610556
           1982    62.617939
           1987    64.851182
           1992    66.537212
           1997    68.020515
           2002    69.233879
           2007    70.728485
Europe     1952    64.408500
           1957    66.703067
           1962    68.539233
           1967    69.737600
           1972    70.775033
           1977    71.937767
           1982    72.806400
           1987    73.642167
           1992    74.440100
           1997    75.505167
           2002    76.700600
           2007    77.648600
Oceania    1952    69.255000
           1957    70.295000
           1962    71.085000
           1967    71.310000
           1972    71.910000
           1977    72.855000
           1982    74.290000
           1987    75.320000
           1992    76.945000
           1997    78.190000
           2002    79.740000
           2007    80.719500
Name: lifeExp, dtype: float64
avg_lifeexp_continent_yr = df.groupby(['continent','year']).lifeExp.mean().reset_index()
avg_lifeexp_continent_yr
   continent  year    lifeExp
0     Africa  1952  39.135500
1     Africa  1957  41.266346
2     Africa  1962  43.319442
3     Africa  1967  45.334538
4     Africa  1972  47.450942
5     Africa  1977  49.580423
6     Africa  1982  51.592865
7     Africa  1987  53.344788
8     Africa  1992  53.629577
9     Africa  1997  53.598269
10    Africa  2002  53.325231
11    Africa  2007  54.806038
12  Americas  1952  53.279840
13  Americas  1957  55.960280
14  Americas  1962  58.398760
15  Americas  1967  60.410920
16  Americas  1972  62.394920
17  Americas  1977  64.391560
18  Americas  1982  66.228840
19  Americas  1987  68.090720
20  Americas  1992  69.568360
21  Americas  1997  71.150480
22  Americas  2002  72.422040
23  Americas  2007  73.608120
24      Asia  1952  46.314394
25      Asia  1957  49.318544
26      Asia  1962  51.563223
27      Asia  1967  54.663640
28      Asia  1972  57.319269
29      Asia  1977  59.610556
30      Asia  1982  62.617939
31      Asia  1987  64.851182
32      Asia  1992  66.537212
33      Asia  1997  68.020515
34      Asia  2002  69.233879
35      Asia  2007  70.728485
36    Europe  1952  64.408500
37    Europe  1957  66.703067
38    Europe  1962  68.539233
39    Europe  1967  69.737600
40    Europe  1972  70.775033
41    Europe  1977  71.937767
42    Europe  1982  72.806400
43    Europe  1987  73.642167
44    Europe  1992  74.440100
45    Europe  1997  75.505167
46    Europe  2002  76.700600
47    Europe  2007  77.648600
48   Oceania  1952  69.255000
49   Oceania  1957  70.295000
50   Oceania  1962  71.085000
51   Oceania  1967  71.310000
52   Oceania  1972  71.910000
53   Oceania  1977  72.855000
54   Oceania  1982  74.290000
55   Oceania  1987  75.320000
56   Oceania  1992  76.945000
57   Oceania  1997  78.190000
58   Oceania  2002  79.740000
59   Oceania  2007  80.719500
type(avg_lifeexp_continent_yr)
<class 'pandas.core.frame.DataFrame'>

The aggregation function, in this case mean, does both the "apply" and "combine" parts of the process.

We can do quick aggregations with pandas

df.groupby('continent').lifeExp.describe()
           count       mean        std  ...      50%       75%     max
continent                               ...                           
Africa     624.0  48.865330   9.150210  ...  47.7920  54.41150  76.442
Americas   300.0  64.658737   9.345088  ...  67.0480  71.69950  80.653
Asia       396.0  60.064903  11.864532  ...  61.7915  69.50525  82.603
Europe     360.0  71.903686   5.433178  ...  72.2410  75.45050  81.757
Oceania     24.0  74.326208   3.795611  ...  73.6650  77.55250  81.235

[5 rows x 8 columns]
df.groupby('continent').nth(10) # Tenth observation in each group
               country  year  lifeExp       pop     gdpPercap
continent                                                    
Africa         Algeria  2002   70.994  31287142   5288.040382
Americas     Argentina  2002   74.340  38331121   8797.640716
Asia       Afghanistan  2002   42.129  25268405    726.734055
Europe         Albania  2002   75.651   3508512   4604.211737
Oceania      Australia  2002   80.370  19546792  30687.754730

You can also use functions from other modules, or your own functions in this aggregation work.

df.groupby('continent').lifeExp.agg(np.mean)
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64
def my_mean(values):
    n = len(values)
    sum = 0
    for value in values:
        sum += value
    return(sum/n)

df.groupby('continent').lifeExp.agg(my_mean)
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

You can do many functions at once

df.groupby('year').lifeExp.agg([np.count_nonzero, np.mean, np.std])
      count_nonzero       mean        std
year                                     
1952          142.0  49.057620  12.225956
1957          142.0  51.507401  12.231286
1962          142.0  53.609249  12.097245
1967          142.0  55.678290  11.718858
1972          142.0  57.647386  11.381953
1977          142.0  59.570157  11.227229
1982          142.0  61.533197  10.770618
1987          142.0  63.212613  10.556285
1992          142.0  64.160338  11.227380
1997          142.0  65.014676  11.559439
2002          142.0  65.694923  12.279823
2007          142.0  67.007423  12.073021

You can also aggregate on different columns at the same time by passing a dict to the agg function

df.groupby('year').agg({'lifeExp': np.mean,'pop': np.median,'gdpPercap': np.median}).reset_index()
     year    lifeExp         pop    gdpPercap
0   1952  49.057620   3943953.0  1968.528344
1   1957  51.507401   4282942.0  2173.220291
2   1962  53.609249   4686039.5  2335.439533
3   1967  55.678290   5170175.5  2678.334741
4   1972  57.647386   5877996.5  3339.129407
5   1977  59.570157   6404036.5  3798.609244
6   1982  61.533197   7007320.0  4216.228428
7   1987  63.212613   7774861.5  4280.300366
8   1992  64.160338   8688686.5  4386.085502
9   1997  65.014676   9735063.5  4781.825478
10  2002  65.694923  10372918.5  5319.804524
11  2007  67.007423  10517531.0  6124.371109


Transformation

You can do grouped transformations using this same method. We will compute the z-score for each year, i.e. we will substract the average life expectancy and divide by the standard deviation

def my_zscore(values):
    m = np.mean(values)
    s = np.std(values)
    return((values - m)/s)
df.groupby('year').lifeExp.transform(my_zscore)
0      -1.662719
1      -1.737377
2      -1.792867
3      -1.854699
4      -1.900878
          ...   
1699   -0.081910
1700   -0.338167
1701   -1.580537
1702   -2.100756
1703   -1.955077
Name: lifeExp, Length: 1704, dtype: float64
df['lifeExp_z'] = df.groupby('year').lifeExp.transform(my_zscore)
df.groupby('year').lifeExp_z.mean()
year
1952   -5.165078e-16
1957    2.902608e-17
1962    2.404180e-16
1967   -6.108181e-16
1972    1.784566e-16
1977   -9.456442e-16
1982   -1.623310e-16
1987    6.687725e-16
1992    5.457293e-16
1997    8.787963e-16
2002    5.254013e-16
2007    4.925637e-16
Name: lifeExp_z, dtype: float64


Filter

We can split the dataset by values of one variable, and filter out those splits that fail some criterion. The following code only keeps countries with a population of at least 10 million at some point during the study period

df.groupby('country').filter(lambda d: d['pop'].max() > 10000000)
          country continent  year  lifeExp       pop   gdpPercap  lifeExp_z
0     Afghanistan      Asia  1952   28.801   8425333  779.445314  -1.662719
1     Afghanistan      Asia  1957   30.332   9240934  820.853030  -1.737377
2     Afghanistan      Asia  1962   31.997  10267083  853.100710  -1.792867
3     Afghanistan      Asia  1967   34.020  11537966  836.197138  -1.854699
4     Afghanistan      Asia  1972   36.088  13079460  739.981106  -1.900878
...           ...       ...   ...      ...       ...         ...        ...
1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306  -0.081910
1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786  -0.338167
1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960  -1.580537
1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623  -2.100756
1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298  -1.955077

[924 rows x 7 columns]