Base R: Reading Plain-Text Files

Loading plain-text files is a simple task. Files of this type are the best for sharing (for example, as a supplement to a publication) and long-term archiving of information. Pay attention to base-R options for skipping the lines, reading only a certain number of lines, and formatting strings – these options are often used by other packages too.

read.table

To load a plain-text file, use read.table. The first argument of read.table should be the name of your file (if it is in your working directory), or the file path to your file (if it is not in your working directory). If the file path does not begin with your root directory, R will append it to the end of the file path that leads to your working directory.You can give read.table other arguments as well. The two most important are sep and header.

If the royal flush data set was saved as a file named poker.csv in your working directory, you could load it with:

poker <- read.table("poker.csv", sep = ",", header = TRUE)

sep

Use sep to tell read.table what character your file uses to separate data entries. To find this out, you might have to open your file in a text editor and look at it. If you don't specify a sep argument, read.table will try to separate cells whenever it comes to white space, such as a tab or space. R won't be able to tell you if read.table does this correctly or not, so rely on it at your own risk.

na.strings

Oftentimes data sets will use special symbols to represent missing information. If you know that your data uses a certain symbol to represent missing entries, you can tell read.table (and the preceding functions) what the symbol is with the na.strings argument. read.table will convert all instances of the missing information symbol to NA, which is R's missing information symbol.

For example, your poker data set contained missing values stored as a ., like this:

## "card","suit","value"
## "ace"," spades"," 14"
## "king"," spades"," 13"
## "queen",".","."
## "jack",".","."
## "ten",".","."

You could read the data set into R and convert the missing values into NAs as you go with the command:

poker <- read.table("poker.csv", sep = ",", header = TRUE, na.string = ".")

R would save a version of poker that looks like this:

##  card    suit value
##   ace  spades    14
##  king  spades    13
## queen    <NA>    NA
##  jack    <NA>    NA
##   ten    <NA>    NA

skip and nrow

Sometimes a plain-text file will come with introductory text that is not part of the data set. Or, you may decide that you only wish to read in part of a data set. You can do these things with the skip and nrow arguments. Use skip to tell R to skip a specific number of lines before it starts reading in values from the file. Use nrow to tell R to stop reading in values after it has read in a certain number of lines.

For example, imagine that the complete royal flush file looks like this:

This data was collected by the National Poker Institute. 
We accidentally repeated the last row of data.

"card", "suit", "value"
"ace", "spades", 14
"king", "spades", 13
"queen", "spades", 12
"jack", "spades", 11
"ten", "spades", 10
"ten", "spades", 10

You can read just the six lines that you want (five rows plus a header) with:

read.table("poker.csv", sep = ",", header = TRUE, skip = 3, nrow = 5)
##    card    suit value
## 1   ace  spades    14
## 2  king  spades    13
## 3 queen  spades    12
## 4  jack  spades    11
## 5   ten  spades    10

Notice that the header row doesn't count towards the total rows allowed by nrow.

stringsAsFactors

R reads in numbers just as you'd expect, but when R comes across character strings (e.g., letters and words) it begins to act strangely. R wants to convert every character string into a factor. This is R's default behavior, but I think it is a mistake. Sometimes factors are useful. At other times, they're clearly the wrong data type for the job. Also factors cause weird behavior, especially when you want to display data. This behavior can be surprising if you didn't realize that R converted your data to factors. In general, you'll have a smoother R experience if you don't let R make factors until you ask for them. Thankfully, it is easy to do this.

Setting the argument stringsAsFactors to FALSE will ensure that R saves any character strings in your data set as character strings, not factors. To use stringsAsFactors, you'd write:

read.table("poker.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)

If you will be loading more than one data file, you can change the default factoring behavior at the global level with:

options(stringsAsFactors = FALSE)

This will ensure that all strings will be read as strings, not as factors, until you end your R session, or rechange the global default by running:

options(stringsAsFactors = TRUE)


The read Family

R also comes with some prepackaged short cuts for read.table, shown in Table D.1.

Table D.1: R's read functions. You can overwrite any of the default arguments as necessary.
Function Defaults Use
read.table sep = " ", header = FALSE General-purpose read function
read.csv sep = ",", header = TRUE Comma-separated-variable (CSV) files
read.delim sep = "", header = TRUE Tab-delimited files
read.csv2 sep = ";", header = TRUE, dec = "," CSV files with European decimal format
read.delim2 sep = "", header = TRUE, dec = "," Tab-delimited files with European decimal format

The first shortcut, read.csv, behaves just like read.table but automatically sets sep = "," and header = TRUE, which can save you some typing:

poker <- read.csv("poker.csv")

read.delim automatically sets sep to the tab character, which is very handy for reading tab delimited files. These are files where each cell is separated by a tab. read.delim also sets header = TRUE by default.

read.delim2 and read.csv2 exist for European R users. These functions tell R that the data uses a comma instead of a period to denote decimal places. (If you're wondering how this works with CSV files, CSV2 files usually separate cells with a semicolon, not a comma.)

Import Dataset

You can also load plain text files with RStudio's Import Dataset button, as described in Loading Data. Import Dataset provides a GUI version of read.table.


read.fwf

One type of plain-text file defies the pattern by using its layout to separate data cells. Each row is placed in its own line (as with other plain-text files), and then each column begins at a specific number of characters from the lefthand side of the document. To achieve this, an arbitrary number of character spaces is added to the end of each entry to correctly position the next entry. These documents are known as fixed-width files and usually end with the extension .fwf.

Here's one way the royal flush data set could look as a fixed-width file. In each row, the suit entry begins exactly 10 characters from the start of the line. It doesn't matter how many characters appeared in the first cell of each row:

card      suit       value
ace       spades     14
king      spades     13  
queen     spades     12  
jack      spades     11  
10        spades     10

Fixed-width files look nice to human eyes (but no better than a tab-delimited file); however, they can be difficult to work with. Perhaps because of this, R comes with a function for reading fixed-width files, but no function for saving them. Unfortunately, US government agencies seem to like fixed-width files, and you'll likely encounter one or more during your career.

You can read fixed-width files into R with the function read.fwf. The function takes the same arguments as read.table but requires an additional argument, widths, which should be a vector of numbers. Each _i_th entry of the widths vector should state the width (in characters) of the _i_th column of the data set.

If the aforementioned fixed-width royal flush data was saved as poker.fwf in your working directory, you could read it with:

poker <- read.fwf("poker.fwf", widths = c(10, 7, 6), header = TRUE)


Source: G. Grolemund, https://rstudio-education.github.io/hopr/dataio.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Last modified: Thursday, December 15, 2022, 4:48 PM