Factors

Factors are the way categorical variables are stored in R. For example, treatment levels in ANOVA (analysis of variance) are considered factors; months or quarters of the year can be represented as factors for modeling seasonality. You should learn how to create factors, rename and reorder factor levels for convenience, and correct analysis (for example, the control treatment usually should be the first level of a factor because, by default, other levels are compared to the first one in linear models).

Creating factors

Imagine that you have a variable that records month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

  1. There are only twelve possible months, and there's nothing saving you from typos:

    x2 <- c("Dec", "Apr", "Jam", "Mar")
  2. It doesn't sort in a useful way:
    sort(x1)
    #> [1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Now you can create a factor:

y1 <- factor(x1, levels = month_levels)
y1
#> [1] Dec Apr Jan Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

And any values not in the set will be silently converted to NA:

y2 <- factor(x2, levels = month_levels)
y2
#> [1] Dec  Apr  <NA> Mar 
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

If you want a warning, you can use readr::parse_factor():

y2 <- parse_factor(x2, levels = month_levels)
#> Warning: 1 parsing failure.
#> row col           expected actual
#>   3  -- value in level set    Jam

If you omit the levels, they'll be taken from the data in alphabetical order:

factor(x1)
#> [1] Dec Apr Jan Mar
#> Levels: Apr Dec Jan Mar

Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the fact, with fct_inorder():

f1 <- factor(x1, levels = unique(x1))
f1
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar

f2 <- x1 %>% factor() %>% fct_inorder()
f2
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar

If you ever need to access the set of valid levels directly, you can do so with levels():

levels(f2)
#> [1] "Dec" "Apr" "Jan" "Mar"