Loading plain-text files is a simple task. Files of this type are the best for sharing (for example, as a supplement to a publication) and long-term archiving of information. Pay attention to base-R options for skipping the lines, reading only a certain number of lines, and formatting strings – these options are often used by other packages too.
read.table
To load a plain-text file, use read.table
. The first argument of read.table
should be the name of your file (if it is in your working directory),
or the file path to your file (if it is not in your working directory).
If the file path does not begin with your root directory, R will append
it to the end of the file path that leads to your working directory.You
can give read.table
other arguments as well. The two most important are sep
and header
.
If the royal flush data set was saved as a file named poker.csv in your working directory, you could load it with:
poker <- read.table("poker.csv", sep = ",", header = TRUE)
sep
Use sep
to tell read.table
what character
your file uses to separate data entries. To find this out, you might
have to open your file in a text editor and look at it. If you don't
specify a sep
argument, read.table
will try to separate cells whenever it comes to white space, such as a tab or space. R won't be able to tell you if read.table
does this correctly or not, so rely on it at your own risk.
header
Use header
to tell read.table
whether the
first line of the file contains variable names instead of values. If the
first line of the file is a set of variable names, you should set header = TRUE
.
na.strings
Oftentimes data sets will use special symbols to represent missing
information. If you know that your data uses a certain symbol to
represent missing entries, you can tell read.table
(and the preceding functions) what the symbol is with the na.strings
argument. read.table
will convert all instances of the missing information symbol to NA
, which is R's missing information symbol.
For example, your poker data set contained missing values stored as a .
, like this:
## "card","suit","value" ## "ace"," spades"," 14" ## "king"," spades"," 13" ## "queen",".","." ## "jack",".","." ## "ten",".","."
You could read the data set into R and convert the missing values into NAs as you go with the command:
poker <- read.table("poker.csv", sep = ",", header = TRUE, na.string = ".")
R would save a version of poker
that looks like this:
## card suit value ## ace spades 14 ## king spades 13 ## queen <NA> NA ## jack <NA> NA ## ten <NA> NA
skip and nrow
Sometimes a plain-text file will come with introductory text that is
not part of the data set. Or, you may decide that you only wish to read
in part of a data set. You can do these things with the skip
and nrow
arguments. Use skip
to tell R to skip a specific number of lines before it starts reading in values from the file. Use nrow
to tell R to stop reading in values after it has read in a certain number of lines.
For example, imagine that the complete royal flush file looks like this:
This data was collected by the National Poker Institute. We accidentally repeated the last row of data. "card", "suit", "value" "ace", "spades", 14 "king", "spades", 13 "queen", "spades", 12 "jack", "spades", 11 "ten", "spades", 10 "ten", "spades", 10
You can read just the six lines that you want (five rows plus a header) with:
read.table("poker.csv", sep = ",", header = TRUE, skip = 3, nrow = 5) ## card suit value ## 1 ace spades 14 ## 2 king spades 13 ## 3 queen spades 12 ## 4 jack spades 11 ## 5 ten spades 10
Notice that the header row doesn't count towards the total rows allowed by nrow
.
stringsAsFactors
R reads in numbers just as you'd expect, but when R comes across character strings (e.g., letters and words) it begins to act strangely. R wants to convert every character string into a factor. This is R's default behavior, but I think it is a mistake. Sometimes factors are useful. At other times, they're clearly the wrong data type for the job. Also factors cause weird behavior, especially when you want to display data. This behavior can be surprising if you didn't realize that R converted your data to factors. In general, you'll have a smoother R experience if you don't let R make factors until you ask for them. Thankfully, it is easy to do this.
Setting the argument stringsAsFactors
to FALSE
will ensure that R saves any character strings in your data set as character strings, not factors. To use stringsAsFactors
, you'd write:
read.table("poker.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
If you will be loading more than one data file, you can change the default factoring behavior at the global level with:
options(stringsAsFactors = FALSE)
This will ensure that all strings will be read as strings, not as factors, until you end your R session, or rechange the global default by running:
options(stringsAsFactors = TRUE)
The read Family
R also comes with some prepackaged short cuts for read.table
, shown in Table D.1.
Function | Defaults | Use |
---|---|---|
read.table |
sep = " ", header = FALSE | General-purpose read function |
read.csv |
sep = ",", header = TRUE | Comma-separated-variable (CSV) files |
read.delim |
sep = "", header = TRUE | Tab-delimited files |
read.csv2 |
sep = ";", header = TRUE, dec = "," | CSV files with European decimal format |
read.delim2 |
sep = "", header = TRUE, dec = "," | Tab-delimited files with European decimal format |
The first shortcut, read.csv
, behaves just like read.table
but automatically sets sep = ","
and header = TRUE
, which can save you some typing:
poker <- read.csv("poker.csv")
read.delim
automatically sets sep
to the
tab character, which is very handy for reading tab delimited files.
These are files where each cell is separated by a tab. read.delim
also sets header = TRUE
by default.
read.delim2
and read.csv2
exist for
European R users. These functions tell R that the data uses a comma
instead of a period to denote decimal places. (If you're wondering how
this works with CSV files, CSV2 files usually separate cells with a
semicolon, not a comma.)
Import Dataset
You can also load plain text files with RStudio's Import Dataset button, as described in Loading Data. Import Dataset provides a GUI version ofread.table
.
read.fwf
One type of plain-text file defies the pattern by using its layout to separate data cells. Each row is placed in its own line (as with other plain-text files), and then each column begins at a specific number of characters from the lefthand side of the document. To achieve this, an arbitrary number of character spaces is added to the end of each entry to correctly position the next entry. These documents are known as fixed-width files and usually end with the extension .fwf.
Here's one way the royal flush data set could look as a fixed-width file. In each row, the suit entry begins exactly 10 characters from the start of the line. It doesn't matter how many characters appeared in the first cell of each row:
card suit value ace spades 14 king spades 13 queen spades 12 jack spades 11 10 spades 10
Fixed-width files look nice to human eyes (but no better than a tab-delimited file); however, they can be difficult to work with. Perhaps because of this, R comes with a function for reading fixed-width files, but no function for saving them. Unfortunately, US government agencies seem to like fixed-width files, and you'll likely encounter one or more during your career.
You can read fixed-width files into R with the function read.fwf
. The function takes the same arguments as read.table
but requires an additional argument, widths
, which should be a vector of numbers. Each _i_th entry of the widths
vector should state the width (in characters) of the _i_th column of the data set.
If the aforementioned fixed-width royal flush data was saved as poker.fwf in your working directory, you could read it with:
poker <- read.fwf("poker.fwf", widths = c(10, 7, 6), header = TRUE)
HTML Links
Many data files are made available on the Internet at their own web
address. If you are connected to the Internet, you can open these files
straight into R with read.table
, read.csv
,
etc. You can pass a web address into the file name argument for any of
R's data-reading functions. As a result, you could read in the poker
data set from a web address like http://…/poker.csv with:
poker <- read.csv("http://.../poker.csv")
That's obviously not a real address, but here's something that would work—if you can manage to type it!
deck <- read.csv("https://gist.githubusercontent.com/garrettgman/9629323/raw/ee5dfc039fd581cb467cc69c226ea2524913c3d8/deck.csv")
Just make sure that the web address links directly to the file and not to a web page that links to the file. Usually, when you visit a data file's web address, the file will begin to download or the raw data will appear in your browser window.
Note that websites that begin with _https://_ are secure websites, which means R may not be able to access the data provided at these links.
Source: G. Grolemund, https://rstudio-education.github.io/hopr/dataio.html This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.