• Unit 4: Data Visualization

Visualization is essential for story-telling with data and communicating the results of your analysis. Graphs help us efficiently assess prominent features in the data (trends, variability), irregularities (changepoints, outliers), and relationships between variables and compare those across different samples. R has powerful tools for creating scientific graphs. Commonly used tools belong to two groups: built-in R functions (base-R) and functions from the package ggplot2 following a bit different grammar of graphics. This unit introduces the syntax for both these approaches and how to export a publication-quality graph from R.

Completing this unit should take you approximately 3 hours.

• 4.1: Base-R and ggplot2 Graphics

There are two main plotting systems in R, the base plotting system (referred to as "base-R graphics" in this course) and the ggplot2 package. Sometimes, the lattice package is counted as the third system, but this package is outside the scope of this course. Both base-R and ggplot2 have their advantages and disadvantages. Generally, you can produce the same plot with any of the systems, but the syntax will be very different. This section aims to introduce both plotting systems so you can navigate their code comfortably. This introduction may be lengthy, but it helps us study specific plot types in the next sections and create the plots quickly in both base R and ggplot2.

• 4.2: Creating Histograms

We use histograms to identify the general position and shape of the distribution, including its center, scale, outliers, and possible multimodality (presence of clusters in the data). This section introduces the base-R and ggplot2 syntax for creating and decorating histograms. An important difference to remember here is that the base-R function hist employs a sensible default method to calculate the number of bins or breaks in the histogram (based on the Sturges formula, which is most suitable for normal-like distributions), while ggplot2 uses 30 bins by default. See the respective help files by running ?hist and ?geom_histogram.

• 4.3: Creating Scatterplots

Scatterplots are commonly used to study relationships between variables that are measured on the horizontal (x) and vertical (y) axes. One can see the strength, direction, and shape of the relationships from a scatterplot and clusters and outliers in the data. Adding features to the plot improves story-telling capabilities. For example, you can color the points using some third variable. This section shows an implementation of these features in R.

• 4.4: Creating Boxplots

You can think about boxplots as scatterplots where one of the variables is a categorical variable, but boxplots are also more than that. Boxplots combine statistical summaries like the quartiles, show the interquartile range (IQR), and tell us about the distribution skewness and outliers. Boxplots provide a more focused view of the data distribution than histograms. This section first explains the statistics represented in a boxplot, then how to create boxplots of different types in R.

• 4.5: Creating Time Series Plots

A time series is a type of variable in which observations are ordered in time. To show this order in a plot, we usually use lines or lines and points rather than just points. Time series plots may show us the general tendency in the data, such as changes in the mean, variance, and periodic patterns. These plots can also be used to detect changepoints and outliers visually.