Examining the Distribution of a Dataset

Even one variable can tell a story. For example, sample data on personal incomes might show distinct clusters of high- and low-paid workers, and time series of average temperatures may show trends and seasonal cycles. Here you will learn R tools for working with such data by combining your experience with plots and simple statistical summaries.

Examining the distribution of a set of data

Given a (univariate) set of data we can examine its distribution in a large number of ways. The simplest is to examine the numbers. Two slightly different summaries are given by summary and fivenum and a display of the numbers by stem (a "stem and leaf" plot).

> attach(faithful)
> summary(eruptions)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.600   2.163   4.000   3.488   4.454   5.100
> fivenum(eruptions)
[1] 1.6000 2.1585 4.0000 4.4585 5.1000
> stem(eruptions)

  The decimal point is 1 digit(s) to the left of the |

  16 | 070355555588
  18 | 000022233333335577777777888822335777888
  20 | 00002223378800035778
  22 | 0002335578023578
  24 | 00228
  26 | 23
  28 | 080
  30 | 7
  32 | 2337
  34 | 250077
  36 | 0000823577
  38 | 2333335582225577
  40 | 0000003357788888002233555577778
  42 | 03335555778800233333555577778
  44 | 02222335557780000000023333357778888
  46 | 0000233357700000023578
  48 | 00000022335800333
  50 | 0370

A stem-and-leaf plot is like a histogram, and R has a function hist to plot histograms.

> hist(eruptions)
## make the bins smaller, make a plot of density
> hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
> lines(density(eruptions, bw=0.1))
> rug(eruptions) # show the actual data points

More elegant density plots can be made by density, and we added a line produced by density in this example. The bandwidth bw was chosen by trial-and-error as the default gives too much smoothing (it usually does for "interesting" densities). (Better automated methods of bandwidth choice are available, and in this example bw = "SJ" gives a good result).


We can plot the empirical cumulative distribution function by using the function ecdf.

> plot(ecdf(eruptions), do.points=FALSE, verticals=TRUE)

This distribution is obviously far from any standard distribution. How about the right-hand mode, say eruptions of longer than 3 minutes? Let us fit a normal distribution and overlay the fitted CDF.

> long <- eruptions[eruptions > 3]
> plot(ecdf(long), do.points=FALSE, verticals=TRUE)
> x <- seq(3, 5.4, 0.01)
> lines(x, pnorm(x, mean=mean(long), sd=sqrt(var(long))), lty=3)

Quantile-quantile (Q-Q) plots can help us examine this more carefully.

par(pty="s")       # arrange for a square figure region
qqnorm(long); qqline(long)

which shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. Let us compare this with some simulated data from a t distribution

images/QQ
x <- rt(250, df = 5)
qqnorm(x); qqline(x)

which will usually (if it is a random sample) show longer tails than expected for a normal. We can make a Q-Q plot against the generating distribution by

qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")
qqline(x)

Finally, we might want a more formal test of agreement with normality (or not). R provides the Shapiro-Wilk test

> shapiro.test(long)

         Shapiro-Wilk normality test

data:  long
W = 0.9793, p-value = 0.01052

and the Kolmogorov-Smirnov test

> ks.test(long, "pnorm", mean = mean(long), sd = sqrt(var(long)))

         One-sample Kolmogorov-Smirnov test

data:  long
D = 0.0661, p-value = 0.4284
alternative hypothesis: two.sided

(Note that the distribution theory is not valid here as we have estimated the parameters of the normal distribution from the same sample).


Source: R Core Team, https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Examining-the-distribution-of-a-set-of-data
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.

Last modified: Sunday, November 13, 2022, 3:09 PM