Factors

Factors are the way categorical variables are stored in R. For example, treatment levels in ANOVA (analysis of variance) are considered factors; months or quarters of the year can be represented as factors for modeling seasonality. You should learn how to create factors, rename and reorder factor levels for convenience, and correct analysis (for example, the control treatment usually should be the first level of a factor because, by default, other levels are compared to the first one in linear models).

General Social Survey

For the rest of this chapter, we're going to focus on forcats::gss_cat. It's a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.

gss_cat
#> # A tibble: 21,483 x 9
#>    year marital      age race  rincome    partyid     relig     denom    tvhours
#>   <int> <fct>      <int> <fct> <fct>      <fct>       <fct>     <fct>      <int>
#> 1  2000 Never mar…    26 White $8000 to … Ind,near r… Protesta… Souther…      12
#> 2  2000 Divorced      48 White $8000 to … Not str re… Protesta… Baptist…      NA
#> 3  2000 Widowed       67 White Not appli… Independent Protesta… No deno…       2
#> 4  2000 Never mar…    39 White Not appli… Ind,near r… Orthodox… Not app…       4
#> 5  2000 Divorced      25 White Not appli… Not str de… None      Not app…       1
#> 6  2000 Married       25 White $20000 - … Strong dem… Protesta… Souther…      NA
#> # … with 21,477 more rows

(Remember, since this dataset is provided by a package, you can get more information about the variables with ?gss_cat.)

When factors are stored in a tibble, you can't see their levels so easily. One way to see them is with count():

gss_cat %>%
  count(race)
#> # A tibble: 3 x 2
#>   race      n
#>   <fct> <int>
#> 1 Other  1959
#> 2 Black  3129
#> 3 White 16395

Or with a bar chart:

ggplot(gss_cat, aes(race)) +
  geom_bar()

By default, ggplot2 will drop levels that don't have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn't yet have a drop option, but it will in the future.

When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.


Exercise

  1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

  2. What is the most common relig in this survey? What's the most common partyid?

  3. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?