Data Structures and Types

Categorical Data

pandas provides a Categorical function and a category object type to Python. This type is analogous to the factor data type in R. It is meant to address categorical or discrete variables, where we need to use them in analyses. Categorical variables typically take on a small number of unique values, like gender, blood type, country of origin, race, etc.

You can create categorical Series in a couple of ways:

s = pd.Series(['a','b','c'], dtype='category') 
df = pd.DataFrame({ 
     'A':3., 
     'B':rng.random_sample(5), 
     'C': pd.Timestamp('20200512'), 
     'D': np.array([6] * 5), 
     'E': pd.Categorical(['yes','no','no','yes','no']), 
     'F': 'NIH'}) 
df['F'].astype('category') 
0    NIH
1    NIH
2    NIH
3    NIH
4    NIH
Name: F, dtype: category
Categories (1, object): [NIH]

You can also create DataFrame's where each column is categorical

df = pd.DataFrame({'A': list('abcd'), 'B': list('bdca')}) 
df_cat = df.astype('category') 
df_cat.dtypes 
A    category
B    category
dtype: object

You can explore categorical data in a variety of ways

df_cat['A'].describe() 
count     4
unique    4
top       d
freq      1
Name: A, dtype: object
df['A'].value_counts() 
d    1
b    1
a    1
c    1
Name: A, dtype: int64

One issue with categories is that if a particular level of a category is not seen before, it can create an error. So you can pre-specify the categories you expect

df_cat['B'] = pd.Categorical(list('aabb'), categories = ['a','b','c','d']) 
df_cat['B'].value_counts() 
b    2
a    2
d    0
c    0
Name: B, dtype: int64


Re-organizing categories

In categorical data, there is often the concept of a "first" or "reference" category and an ordering of categories. This tends to be important in both visualization as well as in regression modeling. Both aspects of a category can be addressed using the reorder_categories function.

In our earlier example, we can see that the A variable has 4 categories, with the "first" category being "a".

df_cat.A
0    a
1    b
2    c
3    d
Name: A, dtype: category
Categories (4, object): [a, b, c, d]

Suppose we want to change this ordering to the reverse ordering, where "d" is the "first" category, and then it goes in reverse order.

df_cat['A'] = df_cat.A.cat.reorder_categories(['d','c','b','a']) 
df_cat.A
 0    a
1    b
2    c
3    d
Name: A, dtype: category
Categories (4, object): [d, c, b, a]