Strings
Although not all of us are linguists or text analysts, R functions for
operating with text strings are still useful. They will come in handy
when you need to match records in the data or select a portion of a
textual record (for example, only the first name and not the surname).
This section covers the basics of these operations.
String basics
You can create strings with either single quotes or double quotes.
Unlike other languages, there is no difference in behaviour. I recommend
always using "
, unless you want to create a string that contains multiple "
.
string1 <- "This is a string" string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
If you forget to close a quote, you'll see +
, the continuation character:
> "This is a string without a closing quote + + + HELP I'M STUCK
If this happen to you, press Escape and try again!
To include a literal single or double quote in a string you can use \
to "escape" it:
double_quote <- "\"" # or '"' single_quote <- '\'' # or "'"
That means if you want to include a literal backslash, you'll need to double it up: "\\"
.
Beware that the printed representation of a string is not the same as
string itself, because the printed representation shows the escapes. To
see the raw contents of the string, use writeLines()
:
x <- c("\"", "\\") x #> [1] "\"" "\\" writeLines(x) #> " #> \
There are a handful of other special characters. The most common are "\n"
, newline, and "\t"
, tab, but you can see the complete list by requesting help on "
: ?'"'
, or ?"'"
. You'll also sometimes see strings like "\u00b5"
, this is a way of writing non-English characters that works on all platforms:
x <- "\u00b5" x #> [1] "µ"
Multiple strings are often stored in a character vector, which you can create with c()
:
c("one", "two", "three") #> [1] "one" "two" "three"
String length
Base R contains many functions to work with strings but we'll avoid
them because they can be inconsistent, which makes them hard to
remember. Instead we'll use functions from stringr. These have more
intuitive names, and all start with str_
. For example, str_length()
tells you the number of characters in a string:
str_length(c("a", "R for data science", NA)) #> [1] 1 18 NA
The common str_
prefix is particularly useful if you use RStudio, because typing str_
will trigger autocomplete, allowing you to see all stringr functions:
Combining strings
To combine two or more strings, use str_c()
:
str_c("x", "y") #> [1] "xy" str_c("x", "y", "z") #> [1] "xyz"
Use the sep
argument to control how they're separated:
str_c("x", "y", sep = ", ") #> [1] "x, y"
Like most other functions in R, missing values are contagious. If you want them to print as "NA"
, use str_replace_na()
:
x <- c("abc", NA) str_c("|-", x, "-|") #> [1] "|-abc-|" NA str_c("|-", str_replace_na(x), "-|") #> [1] "|-abc-|" "|-NA-|"
As shown above, str_c()
is vectorised, and it automatically recycles shorter vectors to the same length as the longest:
str_c("prefix-", c("a", "b", "c"), "-suffix") #> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
Objects of length 0 are silently dropped. This is particularly useful in conjunction with if
:
name <- "Hadley" time_of_day <- "morning" birthday <- FALSE str_c( "Good ", time_of_day, " ", name, if (birthday) " and HAPPY BIRTHDAY", "." ) #> [1] "Good morning Hadley."
To collapse a vector of strings into a single string, use collapse
:
str_c(c("x", "y", "z"), collapse = ", ") #> [1] "x, y, z"
Subsetting strings
You can extract parts of a string using str_sub()
. As well as the string, str_sub()
takes start
and end
arguments which give the (inclusive) position of the substring:
x <- c("Apple", "Banana", "Pear") str_sub(x, 1, 3) #> [1] "App" "Ban" "Pea" # negative numbers count backwards from end str_sub(x, -3, -1) #> [1] "ple" "ana" "ear"
Note that str_sub()
won't fail if the string is too short: it will just return as much as possible:
str_sub("a", 1, 5) #> [1] "a"
You can also use the assignment form of str_sub()
to modify strings:
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1)) x #> [1] "apple" "banana" "pear"
Locales
Above I used str_to_lower()
to change the text to lower case. You can also use str_to_upper()
or str_to_title()
.
However, changing case is more complicated than it might at first
appear because different languages have different rules for changing
case. You can pick which set of rules to use by specifying a locale:
# Turkish has two i's: with and without a dot, and it # has a different rule for capitalising them: str_to_upper(c("i", "ı")) #> [1] "I" "I" str_to_upper(c("i", "ı"), locale = "tr") #> [1] "İ" "I"
The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation. If you don't already know the code for your language, Wikipedia has a good list. If you leave the locale blank, it will use the current locale, as provided by your operating system.
Another important operation that's affected by the locale is sorting. The base R order()
and sort()
functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use str_sort()
and str_order()
which take an additional locale
argument:
x <- c("apple", "eggplant", "banana") str_sort(x, locale = "en") # English #> [1] "apple" "banana" "eggplant" str_sort(x, locale = "haw") # Hawaiian #> [1] "apple" "eggplant" "banana"