Scatterplots in Base R

Site: Saylor Academy
Course: PRDV420: Introduction to R Programming
Book: Scatterplots in Base R
Printed by: Guest user
Date: Sunday, May 19, 2024, 7:59 AM

Description

Here we introduce scatterplots in base R. The codes are simple, but you should also remember the options that make the plots more informative, like adding colors, legends, and error bars.

Scatter Plots

A scatter plot provides a graphical view of the relationship between two sets of numbers. Here we provide examples using the tree data frame from the trees91.csv data file. In particular, we look at the relationship between the stem biomass ("tree$STBM") and the leaf biomass ("tree$LFBM").

The command to plot each pair of points as an x-coordinate and a y-coordinate is "plot:"

> plot(tree$STBM,tree$LFBM)

It appears that there is a strong positive association between the biomass in a tree's stems and the tree's leaves. It appears to be a linear relationship. The correlation between these two sets of observations is quite high:

> cor(tree$STBM,tree$LFBM)
[1] 0.911595

Getting back to the plot, you should always annotate your graphs. The title and labels can be specified in exactly the same way as with the other plotting commands:

> plot(tree$STBM,tree$LFBM,
       main="Relationship Between Stem and Leaf Biomass",
       xlab="Stem Biomass",
       ylab="Leaf Biomass")

Source: K. Black, https://www.cyclismo.org/tutorial/R/plotting.html#scatter-plots
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.

Continuous Data

This section describes:

  • multiple data sets on one plot;
  • error bars;
  • adding noise;
  • multiple graphs on one image; and
  • pairwise relationships.


Source: K. Black, https://www.cyclismo.org/tutorial/R/intermediatePlotting.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.

Multiple Data Sets on One Plot

One common task is to plot multiple data sets on the same plot. In many situations, the way to do this is to create the initial plot and add additional information. For example, to plot bivariate data, the plot command is used to initialize and create the plot. The points command can then add additional data sets to the plot.

First, define a set of normally distributed random numbers and plot them. (This same data set is used throughout the examples below).

> x <- rnorm(10,sd=5,mean=20)
> y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
> cor(x,y)
[1] 0.7400576
> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
> x1 <- runif(8,15,25)
> y1 <- 2.5*x1 - 1.0 + runif(8,-6,6)
> points(x1,y1,col=2)

Note that in the previous example, the color for the second data point set using the col option. You can try different numbers to see what colors are available. There are at least eight options for most installations from 1 to 8. Also, note that the points are plotted as circles in the example above. The symbol that is used can be changed using the pch option.

> x2 <- runif(8,15,25)
> y2 <- 2.5*x2 - 1.0 + runif(8,-6,6)
> points(x2,y2,col=3,pch=2)

Again, try different numbers to see the various options. Another helpful option is to add a legend. This can be done with the legend command. In order, the options for the command are the x and y coordinates on the plot to place the legend, followed by a list of labels to use. There are many other options, so use help(legend) to see more options. For example, a list of colors can be given with the col option, and a list of symbols can be given with the pch option.

> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
> points(x1,y1,col=2,pch=3)
> points(x2,y2,col=4,pch=5)
> legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))


Figure 1. The three data sets are displayed on the same graph.

Another common task is to change the limits of the axes to change the size of the plotting area. This is achieved using the xlim and ylim options in the plot command. Both options take a vector of length two that have the minimum and maximum values.

> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff",xlim=c(0,30),ylim=c(0,100))
> points(x1,y1,col=2,pch=3)
> points(x2,y2,col=4,pch=5)
> legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))

Error Bars

Another common task is to add error bars to a set of data points. This can be accomplished using the arrows command. The arrows command takes two pairs of coordinates, two pairs of x and y values. The command then draws a line between each pair and adds an "arrow head" with a given length and angle.

> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
> xHigh <- x
> yHigh <- y + abs(rnorm(10,sd=3.5))
> xLow <- x
> yLow <- y - abs(rnorm(10,sd=3.1))
> arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)


Figure 2. A data set with error bars added.

Note that the option code is used to specify where the bars are drawn. Its value can be 1, 2, or 3. If code is 1 the bars are drawn at pairs given in the first argument. If code is 2 the bars are drawn at the pairs given in the second argument. If code is 3 the bars are drawn at both.

Adding Noise (jitter)

In the previous example a little bit of "noise" was added to the pairs to produce an artificial offset. This is a common thing to do for making plots. A simpler way to accomplish this is to use the jitter command.

> numberWhite <- rhyper(400,4,5,3)
> numberChipped <- rhyper(400,2,7,3)
> par(mfrow=c(1,2))
> plot(numberWhite,numberChipped,xlab="Number White Marbles Drawn",
       ylab="Number Chipped Marbles Drawn",main="Pulling Marbles")
> plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn",
       ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")


Points with noise added using the jitter command.

Figure 3. Points with noise added using the jitter command.

Multiple Graphs on One Image

Note that a new command was used in the previous example. The par command can be used to set different parameters. In the example above the mfrow was set. The plots are arranged in an array where the default number of rows and columns is one. The mfrow parameter is a vector with two entries. The first entry is the number of rows of images. The second entry is the number of columns. In the example above, the plots were arranged in one row with two plots across.

> par(mfrow=c(2,3))
> boxplot(numberWhite,main="first plot")
> boxplot(numberChipped,main="second plot")
> plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn",
       ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")
> hist(numberWhite,main="fourth plot")
> hist(numberChipped,main="fifth plot")
> mosaicplot(table(numberWhite,numberChipped),main="sixth plot")


Figure 4. An array of plots using the par command.

Pairwise Relationships

There are times that you want to explore a large number of relationships. A number of relationships can be plotted at one time using the pairs command. The idea is that you give it a matrix or a data frame, and the command will create a scatter plot of all combinations of the data.

> uData <- rnorm(20)
> vData <- rnorm(20,mean=5)
> wData <- uData + 2*vData + rnorm(20,sd=0.5)
> xData <- -2*uData+rnorm(20,sd=0.1)
> yData <- 3*vData+rnorm(20,sd=2.5)
> d <- data.frame(u=uData,v=vData,w=wData,x=xData,y=yData)
> pairs(d)

An array of plots using the pairs command.

Figure 5. Using pairs to produce all permutations of a set of relationships on one graph.