Unit 3: Data Import and Export
Data for analysis can be created or simulated within R or loaded from an external file. R can generate regular sequences and samples from probability distributions (random numbers) often used in simulation-based inference. However, most applied tasks require loading existing data in R from some external file or a database. R has several built-in functions to load data; additional packages expand R functionality and allow us to load data saved in special formats like Excel, Matlab, or Network Common Data Form (NetCDF). Besides loading data of different types, this module demonstrates ways to save R outputs in a format like CSV or RDS.
Completing this unit should take you approximately 2 hours.
Upon successful completion of this unit, you will be able to:
- describe R capabilities of loading external data for analysis;
- load tabulated or comma-separated data files using base-R functions;
- load data in Excel format using functions from a specialized R package;
- describe options for saving data in R; and
- export data in CSV and R formats.
3.1: Data Input via Keyboard or Number Generation
This section presents tools for creating data within R without dependence on external sources. These options are quick and convenient, and you have used some of them to create your first R objects.
It can be a good idea to put down a few values directly in your code to create an object to try things on. First, you can use this new "synthetic" dataset to write more code while waiting for the real data. Second, you can use this dataset to debug your code (find the source of an error and fix it). When you complete this section, you will know several ways of creating data objects manually.
R already has a collection of datasets available to you. You can save some time by using these datasets instead of inputting example data manually. You will also notice that many example applications of R functions (given in the section Examples of the R function's help page or online such as on the StackExchange website) use these datasets for demonstrations. Moreover, some R packages supply additional datasets.
In this short practice exercise, you will try using a dataset already loaded in R. It is convenient when you want to try things out on some data (of a certain structure) but do not have your data ready yet.
The tools of random number generation are used for creating entirely new "synthetic" datasets and for permutation, subsampling, and bootstrapping (resampling with replacement) of existing data. You will learn how to use built-in R functions to generate random samples from different probability distributions (more distributions are available from user-contributed packages, such as the package gamlss.dist).
Here you will use functions for randomizing and subsampling things. The exercises also touch on the reproducibility of these random manipulations. Run the code from the following example on your computer. Were you able to obtain the same "random" numbers after the set.seed was implemented?
This video demonstrates the value and power of setting the seed for random number generation. Set seeds to make reproducible results of sampling, bootstrapping, etc., in your research. However, do not overuse this option (or at least be sure to use different seeds). Otherwise, there will be no randomness.
3.2: Loading External Files
Loading the data is probably the first step to starting your analysis. The simplest option is if the dataset is saved in a clean plain format like CSV, but R can deal with many other formats. This section demonstrates tools of the base R and user-contributed packages for loading files of commonly-used types.
This video shows the general approach to loading files: the content (the result of the loading) is assigned to some object, then you can view it in the RStudio viewer. Note that sorting the data in the viewer does not change the sorting order in the R object.
Loading plain-text files is a simple task. Files of this type are the best for sharing (for example, as a supplement to a publication) and long-term archiving of information. Pay attention to base-R options for skipping the lines, reading only a certain number of lines, and formatting strings – these options are often used by other packages too.
Base-R functions are great, but if you prefer to use tidyverse packages and get a tibble upon loading the data, you might want to start with using the readr functions (readr is one of the packages in the tidyverse collection). Remember that these functions (and functions from the package data.table) also are faster than the base-R functions.
These exercises check your understanding of file loading and some useful arguments for skipping or reading a certain number of rows. Keep these arguments in mind, as they come in handy when files have multi-line titles that are not part of the data. Complete these exercises to practice CSV loading. Also, try to load some files from your computer.
When R encounters different formats of numbers (for example, numbers grouped by thousand like "150.300,00" vs. "150,300.00"), dates, etc., it tries to make the best guess and parse the inputs into a corresponding R representation. Here, you will learn how it is done in a series of vector examples.
These exercises provide real-life examples of issues we can encounter when loading a file with different formats for the dates or decimal points. Complete these exercises to prepare yourself for those situations. This exercise does not count toward your grade. It is just for practice!
Here we generalize our knowledge of parsing to parse a whole file. After loading your data, you can check the type of columns in different ways, such as by unfolding the object saved in the Environment and applying the functions str or summary.
While CSV files can be loaded with the base-R functions or functions from other packages, special packages are required for loading Excel files. There are several alternatives (including the packages readxl, xlsx, openxlsx, and XLConnect), but we consider only readxl here because it belongs to the popular tidyverse group of packages and returns the already familiar tibble structure.
User-contributed packages provide tools for loading into R data saved in many other formats. Often several packages can load the same file format – you can find them by searching on the internet.
3.3: Data Export and Reusing R Data
We might need to save our data to take a break from coding or to share the data with others. This section teaches how to save and reload data in R and plain-text formats like CSV. Packages for loading data in other formats often contain functions for saving in those formats – see the help pages for the specific package. For example, the package R.matlab has functions for reading and saving MAT files.
Now you will learn how to save the data from your R session. This works for sharing the results with a friend who also uses R or for preserving the data for later reuse in R. Note the assignment operator is not used when an R image file is loaded.
Here is a short exercise to practice exporting and reusing data. This exercise does not count toward your grade. It is just for practice!
For long-term preservation of data and broader sharing (not just with the R users), it is better to save the data in a plain format like CSV.
Here are the base-R functions to do that. You might find the option row.names = FALSE handy.
The tidyverse also offers options for saving such files. Now you should be familiar with both options (base-R and tidyverse).
Try this short exercise to practice exporting data in CSV and Excel format. This exercise does not count toward your grade. It is just for practice!
This exercise provides a short but complete code for the cycle of loading a dataset, saving, and reloading it in the R project environment that contains the folders "dataraw" and "dataderived". This exercise does not count toward your grade. It is just for practice!
Unit 3 Assessment
- Receive a grade
Take this assessment to see how well you understood this unit.
- This assessment does not count towards your grade. It is just for practice!
- You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
- You can take this assessment as many times as you want, whenever you want.