Pandas Dataframe Operations

Use these materials to practice slicing, indexing, and applying syntax for merging and filtering dataframes. You should recognize a measure of syntax consistency by using, for example, dictionary keys or array indices. At this point, it also is clear that the pandas module offers a much larger set of capabilities.

DataFrame is an essential data structure in Pandas, and there are many ways to operate on it. Arithmetic, logical and bit-wise operations can be done across one or more frames.

Operations specific to data analysis include:

  • Subsetting: Access a specific row/column, range of rows/columns, or a specific item.
  • Slicing: A form of subsetting in which Python slicing syntax [:] is used. Rows/columns are numbered from zero. Negative integers imply traversal from the last row/column.
  • Filtering: Obtain rows that fulfill one or more conditions.
  • Reshaping: Reorganize data such that the number of rows and columns change.
  • Merging: A DataFrame is merged with another. This can be a simple concatenation of frames or database-style joins.

Indexing is a general term for subsetting, slicing, and filtering.


Discussion

What are some ways to access values in a Pandas DataFrame?

Many ways to access rows and columns of a DataFrame. Source: PyData 2020b.

Many ways to access rows and columns of a DataFrame.


Consider a simple DataFrame with index values 10-12 and columns A-C. A row or column can be accessed by its index value or column label, respectively. Column label is used directly in []: df['A'] . Index values are used via .loc[]: df.loc[10] . We can also access a row by its position using .iloc[]: df.iloc[0] .

In our example, there are many ways to get the last item, a scalar value:

  • Row first: df.iloc[2]['C'], df.loc[12]['C'], df.loc[12, 'C']
  • Column first: df['C'][12], df['C'].loc[12], df['C'].iloc[2]


To access multiple rows, use df.loc[[10,11]] or df.iloc[0:2] . For multiple columns, use df[['A','B']] . To get the last two columns of the last two rows, we can write df.loc[[11,12], ['B','C']], df.iloc[1:][['B','C']], df[['B','C']].iloc[1:] or df[['B','C']].loc[[11,12]] . Integer lists df.iloc[[1,2],[1,2]] or slicing df.iloc[1:,1:] or df.iloc[-2:, -2:] can be used.

Positional and label-based indexing can be combined: df.loc[df.index[[0, 2]], 'A'] .

Callable can be used for indexing. Function takes a Series or DataFrame and must return a valid index. For example, df.loc[:, lambda df: ['A','B']] or df[lambda df: df.columns[0]] .

What is multi-indexing on a Pandas DataFrame?

An example of MultiIndex on both rows and columns. Source: Stringham 2019.

An example of MultiIndex on both rows and columns.


Hierarchical indexing or MultiIndex allows us to work with higher dimensional datasets using 2-D DataFrame. A MultiIndex could be seen as an array of unique tuples. In fact, it can be created from a list of arrays, an array of tuples, a DataFrame, or from a product of iterables.

Referring to the figure, rows can be accessed using df.loc['a'] or df.loc['a',:] . For multi-level indexing, use df.loc[('a','one')] , df.loc[('a','one'),:] or df.loc[(slice(None),'one'), :] . Likewise, for columns, we have df['A'], df.loc[:,'A'] or df.loc[:, (slice(None), 'x')] . Multi-indexing across rows and columns can be done with df['A']['x'], df['A','x'] or df.loc[:,('A','x')] . To get specific items, use df.loc[('a','one'),('B','x')] or df.loc[('a','one'),'B'] .

Partial slicing is done using df.loc[('a', 'one'):('b', 'two')] (result has four rows). Reindexing is done with df.loc[[('a', 'one'), ('b', 'two')]] (result has two rows).

With two levels on the index, df.swaplevel() will change the order of the levels.

Note that df.loc[('a','one'),:] is valid but df.loc[['a','one'],:] is not. In this context, lists and tuples have different semantics:

For indexing, a tuple is a multi-level key, whereas a list specifies several keys.

How can I filter a Pandas DataFrame?

Boolean arrays of the same size as the index can be used for filtering, that is, select rows indexed with True. For example, df.loc[~df.index.isin([11])] ignores a specific index; df[df['A']>0] selects only positive values of column A; df[(df['A'] > 2) & (df['B'] < 3)] shows a complex expression where the use of parenthesis is important.

The method query() can simplify the syntax for complex expressions. Thus, for columns a-d, we can write df.query('a not in b') instead of df[~df['a'].isin(df['b'])] ; df.query('a in b and c < d') instead of df[df['b'].isin(df['a']) & (df['c'] < df['d'])] .

Whereas df[df.A>0] will return rows that match the condition, df.where(df.A>0) will return a DataFrame of the same shape as the original. Where the condition is False, NaN will be returned. The optional second argument to where() is a replacement value. The inverse boolean operation of where() is mask() , that is, it returns rows that don't match the condition.

To select only rows with null values, use df[df.isnull().any(axis=1)] . Use df[~df.isnull().any(axis=1)] to do the reverse.

For MultiIndex DataFrame, the  isin() method can be used. For example, df.loc[df.index.isin([('a','one'), ('b','two')])] or df.loc[df.index.isin(['a','c'], level=0)] .


How do I do GroupBy operations on a Pandas DataFrame?


    Visualizing groupby().agg() on a DataFrame. Source: Bosler 2019.

Visualizing groupby().agg() on a DataFrame.

GroupBy uses a split-apply-combine workflow. Data is split using index or column values. A function is applied to each group independently. The results are combined into a single data structure. In the apply step, data can be aggregated (sum, mean, min, etc), transformed (normalize data, fill missing values, etc), or filtered (discard small groups, filter data based on group mean, etc).

To group by single column 'A' or index 'a', use df.groupby('A') or df.groupby('a') . Index grouping can also be done by position: df.groupby(level=0) . To group by multiple columns or indices, use df.groupby(['a', 'A']) . To group by all levels except 'b', use df.groupby(level=df.index.names.difference('b')) . If a column and index share the same name 'b', use df.groupby([pd.Grouper(level='b'), 'A']) (index 'b') or df.groupby([pd.Grouper('b'), 'A']) (column 'b').

To aggregate, simply call the function after grouping: df.groupby('A').sum() ; or df.groupby('A').agg([np.sum, np.mean]) to aggregate in many ways. Each column can be aggregated differently: df.groupby('A').agg({'C': np.sum, 'D': lambda x: np.std(x, ddof=1)}) .

Methods transform() , filter() , resample() , expanding() , rolling() and emw() can be used after df.groupby() . For grouping datetime types, pd.Grouper(key, level, freq, axis, sort) is useful.


How can I reshape a Pandas DataFrame?


    Some methods that reshape a Pandas DataFrame. Source: PyData 2020d.

Some methods that reshape a Pandas DataFrame.

When data is in "record" form, pivot() spreads rows into columns. The method melt() does the opposite. Methods stack() and unstack() are designed for MultiIndex objects, and they're closely related to pivot. They can be applied on Series or DataFrame. Here are a few examples:

  • df.pivot(index='foo', columns='bar', values='baz') : Column 'foo' becomes the index, 'bar' values become new columns, and values of 'baz' becomes values of the new DataFrame. A more generalized API is df.pivot_table(), which allows for duplicate values of an index/column pair.
  • df.melt(id_vars=['A','B']) : Two columns are retained, and other columns are spread into rows. Two new columns named 'variable' (with old column names as values) and 'value' are introduced. The original index can be retained, but values will be duplicated.
  • df.stack() : Columns become part of a new inner-most index level. If DataFrame has hierarchical column labels, the level can be specified as an argument.
  • df.unstack('second') or df.unstack(1): Values of index 'second' are spread into new columns. If no argument is supplied, the inner-most index is spread.

These methods, when combined with GroupBy and aggregations, can be expressive. Examples: df.stack().mean(1).unstack() , df.groupby(level=1, axis=1).mean() , df.stack().groupby(level=1).mean() and df.mean().unstack(0) .


What are the ways to combine two datasets in Pandas?

Database-style joins of two Pandas DataFrame structures. Source: Jain 2020.

Database-style joins of two Pandas DataFrame structures.

Perhaps the simplest to understand is concatenating two or more frames that share the same column labels. Rows of one are concatenated to the other: pd.concat([df1, df2]) . To concatenate column-wise, use pd.concat([df1, df2], axis=1) . If the index values don't match, we re-index before concatenating: pd.concat([df1, df2.reindex(df1.index)], axis=1) . Argument join takes values outer (default) and inner. An outermost level of multi-index can be created with pd.concat([df1, df2], keys=['A','B']) .

Database-style joins are possible with df.join(df1) in which the argument how takes four values: left , right , outer , inner . By default, joins are based on the index. With the  on argument, we can choose to join column or index level of df with index of df1 . Joins based on MultiIndex are supported.

Method df.merge() is more flexible than join since index levels or columns can be used. If merging only columns, indices are ignored. Unlike join, cross merge (a cartesian product of both frames) is possible. Methods pd.merge() , pd.merge_ordered() and pd.merge_asof() are related.


Could you share some best practices or tips for using Pandas DataFrame?

Here are some tips and pointers relevant to DataFrame operations:

  • While df['A'] is used to access a column, df.A is a handy short form. This is called attribute access. The column name must be a valid Python identifier, and it shouldn't conflict with the existing method/attribute. Thus, df.1 and df.first name (has whitespace) are not allowed. df.min and df.index conflict with existing method/attribute.
  • Syntax df['A'] results in a Series data structure whereas df[['A']] returns another DataFrame.
  • On a MultiIndex DataFrame, consider dfmi['one']['second'] versus dfmi.loc[:, ('one','second')] . Both give the same result, but the latter is faster. The former form is actually two operations: first dfmi['one'] returns a DataFrame, then ['second'] is chained to it.
  • To create an explicit copy, write df.copy() . The statements df1 = df; df1[0:3] = 0 don't create a copy and modify the original DataFrame.
  • Remember that loc[] is for label-based indexing and iloc[] is for integer-based indexing. Integers may be used for the former, but they're treated as labels.
  • When filtering by datetime values, their attributes can be used, such as, df.loc[(df['Date'].dt.month==3) | (df['Date'].dt.month==11)] .


Milestones

Jul 2019

Anatomy of a Pandas groupby() with named aggregation. Source: Lynn 2015.

Anatomy of a Pandas groupby() with named aggregation. 

In Pandas 0.25.0, Named Aggregation is introduced. This uses a named tuple pd.NamedAgg() with two fields column and aggfunc . The latter can be set to a callable or a string alias. For a more compact code, plain tuples can be used instead. This is now the recommended replacement for using "dict-of-dicts", which was deprecated in version 0.20.0 (May 2017).


Jul 2020

In Pandas 1.1.0, the  ignore_index argument is added to melt() with default value True. If set to False, the index is retained, although duplicate values will be present in the result. For GroupBy, it's now possible to retain NA values in group keys. For example, df.groupby(by=["b"], dropna=False).sum() will retain NA values in column 'b'.


Dec 2020

In Pandas 1.2.0, exponentially weighted window operations are supported. Thus, we can write, for example,  df.groupby('A').ewm(com=1.0).mean() .


Source: Devopedia, https://devopedia.org/pandas-dataframe-operations
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Last modified: Tuesday, September 27, 2022, 1:14 PM