pandas Dataframes
Data Structures and Types
Categorical Data
pandas
provides a Categorical
function and a category
object type to Python. This type is analogous to the factor
data type in R. It is meant to address categorical or discrete
variables, where we need to use them in analyses. Categorical variables
typically take on a small number of unique values, like gender, blood
type, country of origin, race, etc.
You can create categorical Series
in a couple of ways:
s = pd.Series(['a','b','c'], dtype='category')
df = pd.DataFrame({ 'A':3., 'B':rng.random_sample(5), 'C': pd.Timestamp('20200512'), 'D': np.array([6] * 5), 'E': pd.Categorical(['yes','no','no','yes','no']), 'F': 'NIH'}) df['F'].astype('category')
0 NIH 1 NIH 2 NIH 3 NIH 4 NIH Name: F, dtype: category Categories (1, object): [NIH]
You can also create DataFrame
's where each column is categorical
df = pd.DataFrame({'A': list('abcd'), 'B': list('bdca')}) df_cat = df.astype('category') df_cat.dtypes
A category B category dtype: object
You can explore categorical data in a variety of ways
df_cat['A'].describe()
count 4 unique 4 top d freq 1 Name: A, dtype: object
df['A'].value_counts()
d 1 b 1 a 1 c 1 Name: A, dtype: int64
One issue with categories is that if a particular level of a category is not seen before, it can create an error. So you can pre-specify the categories you expect
df_cat['B'] = pd.Categorical(list('aabb'), categories = ['a','b','c','d']) df_cat['B'].value_counts()
b 2 a 2 d 0 c 0 Name: B, dtype: int64
Re-organizing categories
In categorical data, there is often the concept of a "first" or "reference" category and an ordering of categories. This tends to be
important in both visualization as well as in regression modeling. Both
aspects of a category can be addressed using the reorder_categories
function.
In our earlier example, we can see that the A
variable has 4 categories, with the "first" category being "a".
df_cat.A
0 a 1 b 2 c 3 d Name: A, dtype: category Categories (4, object): [a, b, c, d]
Suppose we want to change this ordering to the reverse ordering, where "d" is the "first" category, and then it goes in reverse order.
df_cat['A'] = df_cat.A.cat.reorder_categories(['d','c','b','a'])
df_cat.A
0 a 1 b 2 c 3 d Name: A, dtype: category Categories (4, object): [d, c, b, a]