## Examining the Distribution of a Dataset

Even one variable can tell a story. For example, sample data on personal incomes might show distinct clusters of high- and low-paid workers, and time series of average temperatures may show trends and seasonal cycles. Here you will learn R tools for working with such data by combining your experience with plots and simple statistical summaries.

### Examining the distribution of a set of data

Given a (univariate) set of data we can examine its distribution in a large number of ways. The simplest is to examine the numbers. Two slightly different summaries are given by summary and fivenum and a display of the numbers by stem (a "stem and leaf" plot).

> attach(faithful)
> summary(eruptions)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.600   2.163   4.000   3.488   4.454   5.100
> fivenum(eruptions)
[1] 1.6000 2.1585 4.0000 4.4585 5.1000
> stem(eruptions)

The decimal point is 1 digit(s) to the left of the |

16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370


A stem-and-leaf plot is like a histogram, and R has a function hist to plot histograms.

> hist(eruptions)
## make the bins smaller, make a plot of density
> hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
> lines(density(eruptions, bw=0.1))
> rug(eruptions) # show the actual data points


More elegant density plots can be made by density, and we added a line produced by density in this example. The bandwidth bw was chosen by trial-and-error as the default gives too much smoothing (it usually does for "interesting" densities). (Better automated methods of bandwidth choice are available, and in this example bw = "SJ" gives a good result).

We can plot the empirical cumulative distribution function by using the function ecdf.

> plot(ecdf(eruptions), do.points=FALSE, verticals=TRUE)


This distribution is obviously far from any standard distribution. How about the right-hand mode, say eruptions of longer than 3 minutes? Let us fit a normal distribution and overlay the fitted CDF.

> long <- eruptions[eruptions > 3]
> plot(ecdf(long), do.points=FALSE, verticals=TRUE)
> x <- seq(3, 5.4, 0.01)
> lines(x, pnorm(x, mean=mean(long), sd=sqrt(var(long))), lty=3)


Quantile-quantile (Q-Q) plots can help us examine this more carefully.

par(pty="s")       # arrange for a square figure region
qqnorm(long); qqline(long)


which shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. Let us compare this with some simulated data from a t distribution

x <- rt(250, df = 5)
qqnorm(x); qqline(x)


which will usually (if it is a random sample) show longer tails than expected for a normal. We can make a Q-Q plot against the generating distribution by

qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")
qqline(x)


Finally, we might want a more formal test of agreement with normality (or not). R provides the Shapiro-Wilk test

> shapiro.test(long)

Shapiro-Wilk normality test

data:  long
W = 0.9793, p-value = 0.01052


and the Kolmogorov-Smirnov test

> ks.test(long, "pnorm", mean = mean(long), sd = sqrt(var(long)))

One-sample Kolmogorov-Smirnov test

data:  long
D = 0.0661, p-value = 0.4284
alternative hypothesis: two.sided


(Note that the distribution theory is not valid here as we have estimated the parameters of the normal distribution from the same sample).

Source: R Core Team, https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Examining-the-distribution-of-a-set-of-data