Topic | Name | Description |
---|---|---|
Course Syllabus | Course Syllabus | |
1.1: R and Coding Environments | Overview of R | Read about R, its history, connections to other languages, and alternatives for statistical computing. You will also learn about various interfaces that can be used to edit and run R code, such as RStudio. |
Introduction to R and RStudio | Here is a short introduction to the basics of the RStudio layout. Now you should have a general understanding of what is R and RStudio, R links to other languages, and potential extension of functionality via user-contributed packages of functions. The next section will teach you how to install and set up this software. |
|
1.2: Installing and Setting Up R and RStudio | Installing R and RStudio | Follow the steps in the video to download and install R, then download and install RStudio Desktop. Both programs are completely free.
The first one is the "core," so it is better if you install it first. Then the interface (RStudio), while being installed, can easily find the R installation on your computer and connect to it. When you start RStudio, it shows the version of R on your computer it has connected to in the console. |
Setting up RStudio | This video will introduce you to the RStudio interface and its main settings. Implement those settings on your computer because they help you to foster good coding habits. By now, you should be comfortable setting up the R and RStudio coding environment on your computer. |
|
Updating Software | If needed, you can update R itself by downloading and installing a new version from the R Project website (by default, after restarting the RStudio, it will start using the newest version of R from those installed on your computer). RStudio can be updated, for example, by clicking Tools > Check for Updates. |
|
1.3: Command Line and Script | Using R as a Calculator | All complex analyses comprise sequences of multiple simple operations. To get acquainted with R and "build trust," please read about these operations you can implement in the R (RStudio) console and immediately see the results. Run the codes on your computer. Do you get the same outputs? Try changing some of the code to see how the outputs will change. You should be able to guess what kind of output you can get from any of these simple operations. |
Practice: Calculator | To start getting familiar with R, go ahead and try the examples on your machine. Do you always get the output you expect? Create some variables (see the section on Variable Assignment) and check that they appear in the RStudio Environment (panel in the top right). This exercise does not count toward your grade. It is just for practice! |
|
1.4: Functions and Packages | Functions | You have used a few functions already. Here is just a bit more formal introduction. You should be able to understand the inputs (arguments) you specify when calling a function and the output it returns. For most functions, you can get help by executing |
Practice: Functions | Here you will practice applying the function |
|
Packages | R functions come in packages like tools come in different toolboxes. You will use the function install.packages("PackageName") to get the needed package and library(PackageName) to load it in your R environment. After the package is installed, add a comment to the installation code like this
so your script doesn't reinstall the package each time but keeps it in
because it is needed each time you start a new R session. It is a good idea to keep all library() statements at the top of your script so you can easily see all packages needed and do not duplicate the library calls. |
|
Updating R and Its Packages | New versions of R are released every year. The best way to be up-to-date with R is to check the CRAN website periodically. The website is updated for each new release and makes the release available for download. You'll have to install the new release, and the process is the same. |
|
Practice: Functions and Packages | As your code gets bigger, it might be hard to spot an error (a "bug" is what programmers call it politely). These examples let you practice debugging (finding and fixing errors in the code) for the code to run smoothly and correctly. Install R packages if needed for the examples to run. This exercise does not count toward your grade. It is just for practice! |
|
1.5: Management of Code and Other Files | R Projects and Files in a Project | First, watch the video, then read about the projects and R working directories. The video demonstrates one of the ways you can efficiently manage your files in a project. The discussed file structure will work in many cases but may need to be revised when large data are used and it is impossible or impractical to move the data to the local data folders. Also, the video assumes that each script file (for data loading, cleaning, plotting, etc.) is relatively large; hence it makes sense to keep the code separately so it is more manageable. Suppose each file (for data loading, cleaning, visualizing, and statistical analysis) contains just a few lines. In that case, it might be more practical to keep the codes together in a single script – you are free to decide based on the needs and size of your project. |
Practice: R Projects | Try this exercise to practice creating and using R projects. This exercise does not count toward your grade. It is just for practice! |
|
Best Practices for Writing R Code | It is helpful to keep your work tidy so you or other users can reuse the code later. Add many comments to your code to explain what and why you are doing or trying to achieve, as well as follow other recommendations in this section. |
|
2.1: Data Types | Basic Data Types and Data Structures in R | These are the data types we encounter in everyday work in R. You should learn about their differences and how to access their basic attributes. Most often, we need to know whether the data are in the correct format (such as numeric instead of character) and the size of the R object (use functions length(x) and dim(x) for that). |
Practice: Data Types | Now let's try a quick practice exercise creating variables. This exercise does not count toward your grade. It is just for practice! |
|
Strings | Although not all of us are linguists or text analysts, R functions for
operating with text strings are still useful. They will come in handy
when you need to match records in the data or select a portion of a
textual record (for example, only the first name and not the surname).
This section covers the basics of these operations. |
|
Practice: Strings | This practice aims to make you more familiar with the string format and operations we often need to match or subset strings while preparing data for analysis. Try to solve these string operations tasks. This exercise does not count toward your grade. It is just for practice! |
|
Factors | Factors are the way categorical variables are stored in R. For example,
treatment levels in ANOVA (analysis of variance) are considered factors;
months or quarters of the year can be represented as factors for
modeling seasonality. You should learn how to create factors, rename and
reorder factor levels for convenience, and correct analysis (for
example, the control treatment usually should be the first level of a
factor because, by default, other levels are compared to the first one
in linear models). |
|
Practice: Factors | In these exercises, you will practice operations with factors needed for implementing the analysis of variance (ANOVA) analysis or drawing a boxplot. Also, try applying the function table(X) to some factor X in your R environment – it is the function that quickly counts the number of occurrences of each element in X. These exercises do not count toward your grade. They are just for practice! |
|
2.2: Vectors | Vectors and Simple Manipulations | This section introduces the basic operations on vectors, most of which are done element-wise. Please pay attention to the recycling of vectors (usually, recycling doesn't generate an error or a warning, so it is easy to miss if it was unintended), missing values (NA), and logical vectors often used for data subsetting. |
Vectors and Type Coercion | The type of your data in R can be changed. Sometimes some other function you apply automatically changes the type internally, while the data object you supplied remains unaffected. For example, if x is a character object, lm(y ~ x) will treat x as a factor; x will remain the type character in the R environment. In other cases, to count the total or proportion of certain instances using a logical vector LV, you can apply sum(LV) or mean(LV) knowing that the logical values TRUE and FALSE will be treated as 1 and 0 by these functions. Please pay attention to these coercion rules. |
|
Practice: Vectors | This exercise shows how easy it is to work with vectors in R and modify and reorganize the data in vectors. In the exercise, you will create and manipulate a vector, then save elements 5-10 of your vector as a new (separate) vector. This exercise does not count toward your grade. It is just for practice! |
|
2.3: Arrays and Matrices | What is the Difference Between Arrays and Matrices? | The video demonstrates the differences between matrices and arrays and how these objects' elements can be accessed or subsetted. |
Arrays in R | An array can be considered as a multiply subscripted collection of data entries. This section provides details on the construction and manipulation of arrays. |
|
Matrices in R | This section provides details on the construction and manipulation of these objects, including matrix facilities that are different from typical element-wise operations. |
|
Practice: Arrays and Matrices | These exercises test your knowledge of creating, accessing, and manipulating arrays and matrices. These exercises do not count toward your grade. They are just for practice! |
|
2.4: Lists and Data Frames | Lists and Data Frames | Lists are used to hold elements of different sizes and types, such as outputs of a regression model fit or results of a statistical test. However, if we restrict list elements to vectors of the same length, we can get a data.frame. The data.frame structure is in-between a matrix (data.frame has columns and rows and can be indexed as a matrix) and a list (each column in a matrix is a list element and can be indexed accordingly). The data.frame structure is convenient for holding typical spreadsheet data, where each column can be of a different type, for example, Date (Date class), Location (character type), and Temperature (numeric type). |
Practice: Base-R Lists and Data Frames | We'll practice with the data frame format, which is the usual format for storing information on different variables. We'll practice the extensions of this format later. Use the cats <- data.frame(coat = c("calico", "black", "tabby"), |
|
The Tibble Format | The tibble format belongs to the family of packages "tidyverse" and attempts to make operations with data.frame-like structures more user-friendly. The tidyverse conveniently aggregates several popular packages, such as ggplot2 for plotting and dplyr for data manipulation. You can convert a data.frame to tibble and back if needed. |
|
Practice: Tibbles | Try working with tibbles in these exercises. Remember, tibbles are just another form of the data.frame format. Do you find tibbles more convenient to use? This exercise does not count toward your grade. It is just for practice! |
|
The data.table Format | The data.table format also helps shorten code when working with data.frame structures. Most importantly, data.table handles big data very efficiently. You can convert a data.frame to data.table and back if needed. |
|
Practice: Data Tables | This is the final practice of data frame formats. Now you should be familiar with the three main ones: data.frame, data.table, and tibble. Repeat the practice tasks for tibbles now using the data.table format instead. This exercise does not count toward your grade. It is just for practice! |
|
3.1: Data Input via Keyboard or Number Generation | Entering Data | It can be a good idea to put down a few values directly in your code to create an object to try things on. First, you can use this new "synthetic" dataset to write more code while waiting for the real data. Second, you can use this dataset to debug your code (find the source of an error and fix it). When you complete this section, you will know several ways of creating data objects manually. |
Data Sets in Base R | R already has a collection of datasets available to you. You can save some time by using these datasets instead of inputting example data manually. You will also notice that many example applications of R functions (given in the section Examples of the R function's help page or online such as on the StackExchange website) use these datasets for demonstrations. Moreover, some R packages supply additional datasets. |
|
Practice: Built-in Datasets | In this short practice exercise, you will try using a dataset already loaded in R. It is convenient when you want to try things out on some data (of a certain structure) but do not have your data ready yet. |
|
Pseudo-Random Number Generation | The tools of random number generation are used for creating entirely new "synthetic" datasets and for permutation, subsampling, and bootstrapping (resampling with replacement) of existing data. You will learn how to use built-in R functions to generate random samples from different probability distributions (more distributions are available from user-contributed packages, such as the package gamlss.dist). |
|
Practice: Random Number Generation | Here you will use functions for randomizing and subsampling things. The exercises also touch on the reproducibility of these random manipulations. Run the code from the following example on your computer. Were you able to obtain the same "random" numbers after the set.seed was implemented? |
|
Reproducible Simulations | This video demonstrates the value and power of setting the seed for random number generation. Set seeds to make reproducible results of sampling, bootstrapping, etc., in your research. However, do not overuse this option (or at least be sure to use different seeds). Otherwise, there will be no randomness. |
|
3.2: Loading External Files | Data Loading and Viewing | This video shows the general approach to loading files: the content (the result of the loading) is assigned to some object, then you can view it in the RStudio viewer. Note that sorting the data in the viewer does not change the sorting order in the R object. |
Base R: Reading Plain-Text Files | Loading plain-text files is a simple task. Files of this type are the best for sharing (for example, as a supplement to a publication) and long-term archiving of information. Pay attention to base-R options for skipping the lines, reading only a certain number of lines, and formatting strings – these options are often used by other packages too. |
|
Tidyverse: Reading Plain-Text Files | Base-R functions are great, but if you prefer to use tidyverse packages and get a tibble upon loading the data, you might want to start with using the readr functions (readr is one of the packages in the tidyverse collection). Remember that these functions (and functions from the package data.table) also are faster than the base-R functions. |
|
Practice: read_csv | These exercises check your understanding of file loading and some useful arguments for skipping or reading a certain number of rows. Keep these arguments in mind, as they come in handy when files have multi-line titles that are not part of the data. Complete these exercises to practice CSV loading. Also, try to load some files from your computer. |
|
Parsing a Vector | When R encounters different formats of numbers (for example, numbers grouped by thousand like "150.300,00" vs. "150,300.00"), dates, etc., it tries to make the best guess and parse the inputs into a corresponding R representation. Here, you will learn how it is done in a series of vector examples. |
|
Practice: Parsing a Vector | These exercises provide real-life examples of issues we can encounter when loading a file with different formats for the dates or decimal points. Complete these exercises to prepare yourself for those situations. This exercise does not count toward your grade. It is just for practice! |
|
Parsing a File | Here we generalize our knowledge of parsing to parse a whole file. After loading your data, you can check the type of columns in different ways, such as by unfolding the object saved in the Environment and applying the functions str or summary. |
|
Using the readxl Package to Read Excel Files | While CSV files can be loaded with the base-R functions or functions from other packages, special packages are required for loading Excel files. There are several alternatives (including the packages readxl, xlsx, openxlsx, and XLConnect), but we consider only readxl here because it belongs to the popular tidyverse group of packages and returns the already familiar tibble structure. |
|
Loading Files From Other Programs | User-contributed packages provide tools for loading into R data saved in many other formats. Often several packages can load the same file format – you can find them by searching on the internet. |
|
3.3: Data Export and Reusing R Data | Saving and Reloading Data in R Format | Now you will learn how to save the data from your R session. This works for sharing the results with a friend who also uses R or for preserving the data for later reuse in R. Note the assignment operator is not used when an R image file is loaded. |
Practice: Export and Reuse | Here is a short exercise to practice exporting and reusing data. This exercise does not count toward your grade. It is just for practice! |
|
Base R: Writing to a CSV File | For long-term preservation of data and broader sharing (not just with the R users), it is better to save the data in a plain format like CSV. Here are the base-R functions to do that. You might find the option row.names = FALSE handy. |
|
Tidyverse: Writing to a CSV File | The tidyverse also offers options for saving such files. Now you should be familiar with both options (base-R and tidyverse). |
|
Practice: Export to a CSV File | Try this short exercise to practice exporting data in CSV and Excel format. This exercise does not count toward your grade. It is just for practice! |
|
Practice: Data Manipulation in a Project | This exercise provides a short but complete code for the cycle of loading a dataset, saving, and reloading it in the R project environment that contains the folders "dataraw" and "dataderived". This exercise does not count toward your grade. It is just for practice! |
|
4.1: Base-R and ggplot2 Graphics | Base-R Graphics | This section introduces the base-R graphics. Reading the materials will familiarize you with different options and commands used for plotting. You should start coding by implementing the high-level function like the plot, then incrementally modify and add code to change the plot appearance and add the function par to fine-tune the margins, etc. You will also learn about the R graphics devices used to save plots for publications (do not use the point-and-click interface to save plots from RStudio); these device commands are also applicable to outputs of the ggplot2. |
Practice: Base-R Plots | In this short practice exercise, you will implement the high-level function plot. It is convenient for fast checks and does not require installing additional packages. This exercise does not count toward your grade. It is just for practice! |
|
Introduction to ggplot | This section introduces the Key Points
|
|
Practice: ggplot | In this exercise, you can practice the implementation of ggplot and compare it to the base-R graphics. This exercise does not count toward your grade. It is just for practice! |
|
4.2: Creating Histograms | Introduction to Histograms | This video shows an interactive approach to creating histograms in base R, developing your code, and addressing the error messages. You will see more details on the available options in the next sections. |
Histograms and Density Plots in ggplot2 | Now you will learn the ggplot2 syntax for building and customizing histograms. |
|
Histograms and Density Plots in base R | Here you will see more examples of how to build histograms in base R. Note that when the total counts for two or more samples are different, we can convert the vertical axis to density so the distributions can be easily compared on the same plot. |
|
Practice: Histograms | In this exercise, you will practice plotting a histogram for a publication. This exercise does not count toward your grade. It is just for practice! |
|
4.3: Creating Scatterplots | Introduction to Scatterplots | This video demonstrates the steps to create and tailor a scatterplot in the base-R plotting system. Notice the incremental development of the code, adding elements to the plot and checking its view in the plot window. Finally, the code in the video also uses the png command to export the resulting plot for publication. |
Scatterplots in Base R | Here we introduce scatterplots in base R. The codes are simple, but you should also remember the options that make the plots more informative, like adding colors, legends, and error bars. |
|
Scatterplots in ggplot2 | You will learn the layered syntax of ggplot2 for scatterplots in this section. It also demonstrates how regression lines can be added (compared with the base-R syntax shown in the introductory video). |
|
Practice: Scatterplots | In this exercise, you practice producing scatterplots for a publication. This exercise does not count toward your grade. It is just for practice! |
|
4.4: Creating Boxplots | Introduction to Boxplots | This video shows how a boxplot is built. You should understand what each of the bars and whiskers means so you can interpret the boxplot. |
Boxplots in Base R | This section introduces the functionality of the base-R function boxplot. Note that for some data formats, the plot function with x being a factor variable will also work. |
|
Boxplots in ggplot2 | In this section, you will learn the ggplot2 codes for producing boxplots. While the syntax and default appearance may differ, these plots aim to compare distributions and identify outliers. If you need, you can add a few lines of code to make the base-R and ggplot2 graphs look the same. The choice of which plotting system to use is yours now. |
|
Practice: Boxplots | This quick practice exercise asks you to produce boxplots for a publication. This exercise does not count toward your grade. It is just for practice! |
|
4.5: Creating Time Series Plots | Time Series Plots in Base R | This section is a short introduction to time series plots in R. You can use the analogy with the scatterplots where the horizontal axis is time. |
The ts Format | If you save the data in the special format ts, the plotting function plot.ts can produce a better-looking x-axis automatically. The ts format adds attributes to your data, such as the beginning and end times and frequency. This section shows how you can convert a usual vector to the ts format, then plot it. |
|
Time Series Plots Using ggplot2 | Of course, the ggplot2 can also visualize time series. This section introduces the relevant ggplot2 syntax. |
|
Practice: Time Series Plots | In this exercise, you practice plotting a time series for a publication. This exercise does not count toward your grade. It is just for practice! |
|
5.1: Single-Sample Summaries | Basic Summary Statistics | After plotting the data, mean and variance are some of the basic summaries that we want to know. R has built-in functions to calculate mean, sd, var, and median. This video demonstrates the calculations and, using the plot, shows how the results relate to the sample data. |
Examining the Distribution of a Dataset | Even one variable can tell a story. For example, sample data on personal incomes might show distinct clusters of high- and low-paid workers, and time series of average temperatures may show trends and seasonal cycles. Here you will learn R tools for working with such data by combining your experience with plots and simple statistical summaries. |
|
Alternatives and Extensions | As you already know, for each base-R operation, there are user-contributed alternatives. This video demonstrates the function describe from the package psych, which outputs more statistics than the standard function summary. (You already know how to install and load the package to your R environment.) Be careful, as user-contributed packages might use the same names for their functions. For example, the package Hmisc also has a function describe that produces a different output. |
|
Practice: Statistical Summary | Functions for individual quantities like mean or median are convenient when we want to use that specific number in further analysis or visualizations, but the function summary and its alternatives are great for exploratory analysis. In the exercise, you can practice both approaches. This exercise does not count toward your grade. It is just for practice! |
|
Tables | Finally, the function table can count the number of observations per group. It is most useful when applied to factors, integers, logical values, or strings. It allows you to study group counts, proportions, and identify outliers. This section demonstrates the application of this function and how it can be applied to more than one variable. |
|
5.2: The t-test | One- and Two-Sample t-tests | The t-test is quite simple, and the base-R functionality will likely be sufficient for all your related calculations. This section introduces the plots and testing functions that help us to conduct the inference based on the t-test and its nonparametric alternative, the Wilcoxon (or Mann-Whitney) test. |
Applying the t-test | Fortunately, the t-test calculations can be modified for the cases when the assumption of equal variances across groups is violated. In other words, Welch's version of the t-test accounts for unequal variances. This video demonstrates the test application in R and the relevant options for implementing it. |
|
The Power of the t-test | The greater the difference between compared quantities and the more observations we have, the more confident we (the t-test) are that the observed differences are not just due to a random chance but are true, statistically significant differences. Even if the means of two populations are different, the t-test might not detect it if the difference or the sample is small. The probability with which the t-test would detect the difference under the given sample size and variability is the power of the test and can be calculated in R. We prefer high power and often use the desired power, confidence level, and expected variability to identify the required sample size. |
|
Practice: t-test | In this exercise, you will use the t-test and Wilcoxon test to compare the Examination rates across the two groups. This exercise does not count toward your grade. It is just for practice! |
|
5.3: One-Way ANOVA | The Basics of One-Way ANOVA | This section introduces base-R functionality for the one-way ANOVA. "One-way" means that only one factor variable is used, such as in the case of BloodPressure ~ ExerciseLevel. Be aware that when two or more factors are used the contributed functions like car::Anova are preferred because they have the option to apply different types of the F-test and conduct inference without depending on the order the factors are introduced in the R formula. |
ANOVA in afex and car | This video shows the implementation of ANOVA in the packages car and afex. It probably makes sense to start using one of these packages for ANOVA analysis instead of the base-R functions aov and anova, even if you have only one factor variable to start with. The video also covers a range of post-hoc tests used to find which groups the statistically significant differences occur between. These tests are useful, but the global ANOVA test is not needed for the analyst to start using these tests – just remember to use an adjustment for multiple testing. |
|
Practice: ANOVA | In this practice exercise, you will use the built-in dataset iris to test whether the |
|
5.4: Linear Regression | Model Basics | Models are simplified representations of reality based on available observations. Both the observations and our assumptions about the form of the existing relationships affect the model we get as an outcome of the analysis. Here you will learn the general approach for specifying and estimating a linear model in statistics. |
Practice: Model Basics | While R makes the model fitting process extremely easy, several steps or implicit decisions go into it. For example, one may choose to keep or remove extreme observations (outliers) and select the optimization algorithm. This exercise demonstrates the effects of these decisions on the modeling outcomes. |
|
Visualizing Models | One of the best tools to check the quality of a model is to plot things. This section shows how to visualize modeling results and the unmodeled remainder (residuals) to diagnose the model. Remember that residuals should not have any remaining pattern and should look randomly scattered. If there is a remaining pattern, try to include it in your model (that is, respecify the model), then reestimate the model and visualize the new residuals. |
|
Practice: Visual Model Checks | You should get used to checking model quality visually. Look for inconsistencies between the data cloud pattern and the fitted lines for patterns in residuals and outlying observations. These exercises give examples and suggest R functions you can use for these tasks. This exercise does not count toward your grade. It is just for practice! |
|
Formulas and Model Families | Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables. |
|
Practice: Formulas | We often keep the intercept in the model even if it is not statistically significant because our main focus is usually on the effect of other variables expressed in their coefficients. However, there are cases when we need to remove the intercept to obtain the so-called "regression through the origin". Also, we might need to model the combined effect of two factors using the interaction term (for example, to model how light and water conditions affect plant growth). These exercises let you practice these cases and suggest you compare alternative models. It does not count toward your grade and is just practice! |
|
Course Feedback Survey | Course Feedback Survey |