Measures of Central Location

This section elaborates on mean, median, and mode at the population level and sample level. This section also contains many interesting examples of range, variance, and standard deviation. Complete the exercises and check your answers.

Measures of Central Location

The Median

To see why another concept of average is needed, consider the following situation. Suppose we are interested in the average yearly income of employees at a large corporation. We take a random sample of seven employees, obtaining the sample data (rounded to the nearest hundred dollars, and expressed in thousands of dollars).

\begin{array}{lllllll}24.8 & 22.8 & 24.6 & 192.5 & 25.2 & 18.5 & 23.7\end{array}

The mean (rounded to one decimal place) is \bar{x}=47.4, but the statement "the average income of employees at this corporation is \$ 47,400" is surely misleading. It is approximately twice what six of the seven employees in the sample make and is nowhere near what any of them makes. It is easy to see what went wrong: the presence of the one executive in the sample, whose salary is so large compared to everyone else's, caused the numerator in the formula for the sample mean to be far too large, pulling the mean far to the right of where we think that the average "ought" to be, namely around \$ 24,000 or \$ 25,000. The number 192.5 in our data set is called an outlier, a number that is far removed from most or all of the remaining measurements. Many times an outlier is the result of some sort of error, but not always, as is the case here. We would get a better measure of the "center" of the data if we were to arrange the data in numerical order,

\begin{array}{lllllll}18.5 & 22.8 & 23.7 & 24.6 & 24.8 & 25.2 & 192.5\end{array}

then select the middle number in the list, in this case 24.6. The result is called the median of the data set, and has the property that roughly half of the measurements are larger than it is, and roughly half are smaller. In this sense it locates the center of the data. If there are an even number of measurements in the data set, then there will be two middle elements when all are lined up in order, so we take the mean of the middle two as the median. Thus we have the following definition.

Definition

The sample median \widetilde{x} of a set of sample data for which there are an odd number of measurements is the middle measurement when the data are arranged in numerical order. The sample median \widetilde{x} of a set of sample data for which there are an even number of measurements is the mean of the two middle measurements when the data are arranged in numerical order.

The population median is defined in a similar way, but we will not have occasion to refer to it again in this text.

The median is a value that divides the observations in a data set so that 50% of the data are on its left and the other 50% on its right. In accordance with Figure 2.6 "A Very Fine Relative Frequency Histogram", therefore, in the curve that represents the distribution of the data, a vertical line drawn at the median divides the area in two, area 0.5 (50% of the total area 1) to the left and area 0.5 (50% of the total area 1) to the right, as shown in Figure 2.7 "The Median". In our income example the median, $24,600, clearly gave a much better measure of the middle of the data set than did the mean $47,400. This is typical for situations in which the distribution is skewed. (Skewness and symmetry of distributions are discussed at the end of this subsection).

Figure 2.7 The Median


EXAMPLE 5

Compute the sample median for the data of Note 2.11 "Example 2".


Solution:

The data in numerical order are -1, \, 0, \, 2, \, 2. The two middle measurements are \mathrm{0} and \mathrm{2}, so \widetilde{x}=(0+2) / 2=1.


EXAMPLE 6

Compute the sample median for the data of Note 2.12 "Example 3".


Solution:

The data in numerical order are

\begin{array}{llllllllll}1.39 & 1.76 & 1.90 & 2.12 & 2.53 & 2.71 & 3.00 & 3.33 & 3.71 & 4.00\end{array}

The number of observations is ten, which is even, so there are two middle measurements, the fifth and sixth, which are 2.53 and 2.71. Therefore the median of these data is \widetilde{x}=(2.53+2.71) / 2=2.62.


EXAMPLE 7

Compute the sample median for the data of Note 2.13 "Example 4".


Solution:

The data in numerical order are

\begin{array}{lllllllllllllllllll}0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 2 & 2 & 2 & 2 & 2 & 2 & 3 & 3 & 3 & 4\end{array}

The number of observations is \mathrm{19}, which is odd, so there is one middle measurement, the tenth. Since the tenth measurement is \mathrm{2}, the median is \widetilde{x}=2.

It is important to note that we could have computed the median without first explicitly listing all the observations in the data set. We already saw in Note 2.13 "Example 4" how to find the number of observations directly from the frequencies listed in the table: n=3+6+6+3+1=19. As just above we figure out that the median is the tenth observation. The second line of the table in Note 2.13 "Example 4" shows that when the data are listed in order there will be three \mathrm{0}s followed by six \mathrm{1}s, so the tenth observation is a \mathrm{2}. The median is therefore \mathrm{2}.

The relationship between the mean and the median for several common shapes of distributions is shown in Figure 2.8 "Skewness of Relative Frequency Histograms". The distributions in panels (a) and (b) are said to be symmetric because of the symmetry that they exhibit. The distributions in the remaining two panels are said to be skewed. In each distribution we have drawn a vertical line that divides the area under the curve in half, which in accordance with Figure 2.7 "The Median" is located at the median. The following facts are true in general:

  • When the distribution is symmetric, as in panels (a) and (b) of Figure 2.8 "Skewness of Relative Frequency Histograms", the mean and the median are equal.
  • When the distribution is as shown in panel (c) of Figure 2.8 "Skewness of Relative Frequency Histograms", it is said to be skewed right. The mean has been pulled to the right of the median by the long "right tail" of the distribution, the few relatively large data values.
  • When the distribution is as shown in panel (d) of Figure 2.8 "Skewness of Relative Frequency Histograms", it is said to be skewed left. The mean has been pulled to the left of the median by the long "left tail" of the distribution, the few relatively small data values.

Figure 2.8 Skewness of Relative Frequency Histograms