With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations.
Data Aggregation and Split-Apply-Combine
We'll use the Gapminder dataset for this section
df = pd.read_csv('data/gapminder.tsv', sep = '\t') # data is tab-separated, so we use '\t' to specify that
The paradigm we will be exploring is often called split-apply-combine or MapReduce or grouped aggregation. The basic idea is that you split a data set up by some feature, apply a recipe to each piece, compute the result, and then put the results back together into a dataset. This can be described in teh following schematic.
pandas
is set up for this. It features the groupby
function that allows the "split" part of the operation. We can then
apply a function to each part and put it back together. Let's see how.
df.head()
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106
f"This dataset has {len(df['country'].unique())} countries in it"
'This dataset has 142 countries in it'
One of the variables in this dataset is life expectancy at birth, lifeExp
. Suppose we want to find the average life expectancy of each country over the period of study.
df.groupby('country')['lifeExp'].mean()
country Afghanistan 37.478833 Albania 68.432917 Algeria 59.030167 Angola 37.883500 Argentina 69.060417 ... Vietnam 57.479500 West Bank and Gaza 60.328667 Yemen, Rep. 46.780417 Zambia 45.996333 Zimbabwe 52.663167 Name: lifeExp, Length: 142, dtype: float64
So what's going on here? First, we use the groupby
function, telling pandas
to split the dataset up by values of the column country
.
df.groupby('country')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11cfb1b50>
pandas
won't show you the actual data, but will tell you
that it is a grouped dataframe object. This means that each element of
this object is a DataFrame
with data from one country.
df.groupby('country').ngroups
142
df.groupby('country').get_group('United Kingdom')
country continent year lifeExp pop gdpPercap 1596 United Kingdom Europe 1952 69.180 50430000 9979.508487 1597 United Kingdom Europe 1957 70.420 51430000 11283.177950 1598 United Kingdom Europe 1962 70.760 53292000 12477.177070 1599 United Kingdom Europe 1967 71.360 54959000 14142.850890 1600 United Kingdom Europe 1972 72.010 56079000 15895.116410 1601 United Kingdom Europe 1977 72.760 56179000 17428.748460 1602 United Kingdom Europe 1982 74.040 56339704 18232.424520 1603 United Kingdom Europe 1987 75.007 56981620 21664.787670 1604 United Kingdom Europe 1992 76.420 57866349 22705.092540 1605 United Kingdom Europe 1997 77.218 58808266 26074.531360 1606 United Kingdom Europe 2002 78.471 59912431 29478.999190 1607 United Kingdom Europe 2007 79.425 60776238 33203.261280
type(df.groupby('country').get_group('United Kingdom'))
<class 'pandas.core.frame.DataFrame'>
avg_lifeexp_country = df.groupby('country').lifeExp.mean() avg_lifeexp_country['United Kingdom']
73.92258333333332
df.groupby('country').get_group('United Kingdom').lifeExp.mean()
73.92258333333332
Let's look at if life expectancy has gone up over time, by continent
df.groupby(['continent','year']).lifeExp.mean()
continent year Africa 1952 39.135500 1957 41.266346 1962 43.319442 1967 45.334538 1972 47.450942 1977 49.580423 1982 51.592865 1987 53.344788 1992 53.629577 1997 53.598269 2002 53.325231 2007 54.806038 Americas 1952 53.279840 1957 55.960280 1962 58.398760 1967 60.410920 1972 62.394920 1977 64.391560 1982 66.228840 1987 68.090720 1992 69.568360 1997 71.150480 2002 72.422040 2007 73.608120 Asia 1952 46.314394 1957 49.318544 1962 51.563223 1967 54.663640 1972 57.319269 1977 59.610556 1982 62.617939 1987 64.851182 1992 66.537212 1997 68.020515 2002 69.233879 2007 70.728485 Europe 1952 64.408500 1957 66.703067 1962 68.539233 1967 69.737600 1972 70.775033 1977 71.937767 1982 72.806400 1987 73.642167 1992 74.440100 1997 75.505167 2002 76.700600 2007 77.648600 Oceania 1952 69.255000 1957 70.295000 1962 71.085000 1967 71.310000 1972 71.910000 1977 72.855000 1982 74.290000 1987 75.320000 1992 76.945000 1997 78.190000 2002 79.740000 2007 80.719500 Name: lifeExp, dtype: float64
avg_lifeexp_continent_yr = df.groupby(['continent','year']).lifeExp.mean().reset_index() avg_lifeexp_continent_yr
continent year lifeExp 0 Africa 1952 39.135500 1 Africa 1957 41.266346 2 Africa 1962 43.319442 3 Africa 1967 45.334538 4 Africa 1972 47.450942 5 Africa 1977 49.580423 6 Africa 1982 51.592865 7 Africa 1987 53.344788 8 Africa 1992 53.629577 9 Africa 1997 53.598269 10 Africa 2002 53.325231 11 Africa 2007 54.806038 12 Americas 1952 53.279840 13 Americas 1957 55.960280 14 Americas 1962 58.398760 15 Americas 1967 60.410920 16 Americas 1972 62.394920 17 Americas 1977 64.391560 18 Americas 1982 66.228840 19 Americas 1987 68.090720 20 Americas 1992 69.568360 21 Americas 1997 71.150480 22 Americas 2002 72.422040 23 Americas 2007 73.608120 24 Asia 1952 46.314394 25 Asia 1957 49.318544 26 Asia 1962 51.563223 27 Asia 1967 54.663640 28 Asia 1972 57.319269 29 Asia 1977 59.610556 30 Asia 1982 62.617939 31 Asia 1987 64.851182 32 Asia 1992 66.537212 33 Asia 1997 68.020515 34 Asia 2002 69.233879 35 Asia 2007 70.728485 36 Europe 1952 64.408500 37 Europe 1957 66.703067 38 Europe 1962 68.539233 39 Europe 1967 69.737600 40 Europe 1972 70.775033 41 Europe 1977 71.937767 42 Europe 1982 72.806400 43 Europe 1987 73.642167 44 Europe 1992 74.440100 45 Europe 1997 75.505167 46 Europe 2002 76.700600 47 Europe 2007 77.648600 48 Oceania 1952 69.255000 49 Oceania 1957 70.295000 50 Oceania 1962 71.085000 51 Oceania 1967 71.310000 52 Oceania 1972 71.910000 53 Oceania 1977 72.855000 54 Oceania 1982 74.290000 55 Oceania 1987 75.320000 56 Oceania 1992 76.945000 57 Oceania 1997 78.190000 58 Oceania 2002 79.740000 59 Oceania 2007 80.719500
type(avg_lifeexp_continent_yr)
<class 'pandas.core.frame.DataFrame'>
The aggregation function, in this case mean
, does both the "apply" and "combine" parts of the process.
We can do quick aggregations with pandas
df.groupby('continent').lifeExp.describe()
count mean std ... 50% 75% max continent ... Africa 624.0 48.865330 9.150210 ... 47.7920 54.41150 76.442 Americas 300.0 64.658737 9.345088 ... 67.0480 71.69950 80.653 Asia 396.0 60.064903 11.864532 ... 61.7915 69.50525 82.603 Europe 360.0 71.903686 5.433178 ... 72.2410 75.45050 81.757 Oceania 24.0 74.326208 3.795611 ... 73.6650 77.55250 81.235 [5 rows x 8 columns]
df.groupby('continent').nth(10) # Tenth observation in each group
country year lifeExp pop gdpPercap continent Africa Algeria 2002 70.994 31287142 5288.040382 Americas Argentina 2002 74.340 38331121 8797.640716 Asia Afghanistan 2002 42.129 25268405 726.734055 Europe Albania 2002 75.651 3508512 4604.211737 Oceania Australia 2002 80.370 19546792 30687.754730
You can also use functions from other modules, or your own functions in this aggregation work.
df.groupby('continent').lifeExp.agg(np.mean)
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
def my_mean(values): n = len(values) sum = 0 for value in values: sum += value return(sum/n) df.groupby('continent').lifeExp.agg(my_mean)
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
You can do many functions at once
df.groupby('year').lifeExp.agg([np.count_nonzero, np.mean, np.std])
count_nonzero mean std year 1952 142.0 49.057620 12.225956 1957 142.0 51.507401 12.231286 1962 142.0 53.609249 12.097245 1967 142.0 55.678290 11.718858 1972 142.0 57.647386 11.381953 1977 142.0 59.570157 11.227229 1982 142.0 61.533197 10.770618 1987 142.0 63.212613 10.556285 1992 142.0 64.160338 11.227380 1997 142.0 65.014676 11.559439 2002 142.0 65.694923 12.279823 2007 142.0 67.007423 12.073021
You can also aggregate on different columns at the same time by passing a dict
to the agg
function
df.groupby('year').agg({'lifeExp': np.mean,'pop': np.median,'gdpPercap': np.median}).reset_index()
year lifeExp pop gdpPercap 0 1952 49.057620 3943953.0 1968.528344 1 1957 51.507401 4282942.0 2173.220291 2 1962 53.609249 4686039.5 2335.439533 3 1967 55.678290 5170175.5 2678.334741 4 1972 57.647386 5877996.5 3339.129407 5 1977 59.570157 6404036.5 3798.609244 6 1982 61.533197 7007320.0 4216.228428 7 1987 63.212613 7774861.5 4280.300366 8 1992 64.160338 8688686.5 4386.085502 9 1997 65.014676 9735063.5 4781.825478 10 2002 65.694923 10372918.5 5319.804524 11 2007 67.007423 10517531.0 6124.371109
Transformation
You can do grouped transformations using this same method. We will compute the z-score for each year, i.e. we will substract the average life expectancy and divide by the standard deviation
def my_zscore(values): m = np.mean(values) s = np.std(values) return((values - m)/s)
df.groupby('year').lifeExp.transform(my_zscore)
0 -1.662719 1 -1.737377 2 -1.792867 3 -1.854699 4 -1.900878 ... 1699 -0.081910 1700 -0.338167 1701 -1.580537 1702 -2.100756 1703 -1.955077 Name: lifeExp, Length: 1704, dtype: float64
df['lifeExp_z'] = df.groupby('year').lifeExp.transform(my_zscore)
df.groupby('year').lifeExp_z.mean()
year 1952 -5.165078e-16 1957 2.902608e-17 1962 2.404180e-16 1967 -6.108181e-16 1972 1.784566e-16 1977 -9.456442e-16 1982 -1.623310e-16 1987 6.687725e-16 1992 5.457293e-16 1997 8.787963e-16 2002 5.254013e-16 2007 4.925637e-16 Name: lifeExp_z, dtype: float64
Filter
We can split the dataset by values of one variable, and filter out those splits that fail some criterion. The following code only keeps countries with a population of at least 10 million at some point during the study period
df.groupby('country').filter(lambda d: d['pop'].max() > 10000000)
country continent year lifeExp pop gdpPercap lifeExp_z 0 Afghanistan Asia 1952 28.801 8425333 779.445314 -1.662719 1 Afghanistan Asia 1957 30.332 9240934 820.853030 -1.737377 2 Afghanistan Asia 1962 31.997 10267083 853.100710 -1.792867 3 Afghanistan Asia 1967 34.020 11537966 836.197138 -1.854699 4 Afghanistan Asia 1972 36.088 13079460 739.981106 -1.900878 ... ... ... ... ... ... ... ... 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306 -0.081910 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786 -0.338167 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960 -1.580537 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623 -2.100756 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298 -1.955077 [924 rows x 7 columns]