# Measures of Central Location

Site: | Saylor Academy |

Course: | MA121: Introduction to Statistics |

Book: | Measures of Central Location |

Printed by: | Guest user |

Date: | Friday, August 16, 2024, 12:18 AM |

## Description

This section elaborates on mean, median, and mode at the population level and sample level. This section also contains many interesting examples of range, variance, and standard deviation. Complete the exercises and check your answers.

## Measures of Central Location

#### LEARNING OBJECTIVES

- To learn the concept of the "center" of a data set.
- To learn the meaning of each of three measures of the center of a data set - the mean, the median, and the mode - and how to compute each one.

This text was adapted by Saylor Academy under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License without attribution as requested by the work's original creator or licensor.

### The Mean

The first measure of central location is the usual "average" that is familiar to everyone. In the formula in the following definition we introduce the standard summation notation , where is the capital Greek letter sigma. In general, the notation followed by a second mathematical symbol means to add up all the values that the second symbol can take in the context of the problem. Here is an example to illustrate this.

#### EXAMPLE 1

#### Solution:

In the definition we follow the convention of using lowercase to denote the number of measurements in a sample, which is called the **sample size**.

#### Definition

*The sample mean of a set of sample data is the number defined by the formula*

#### EXAMPLE 2

Find the mean of the sample data#### Solution:

#### EXAMPLE 3

A random sample of ten students is taken from the student body of a college and their GPAs are recorded as follows.Find the sample mean.

#### Solution:

#### EXAMPLE 4

A random sample of women beyond child-bearing age gave the following data, where is the number of children and is the frequency of that value, the number of times it occurred in the data set.Find the sample mean.

#### Solution:

In this example the data are presented by means of a data frequency table. Each number in the first line of the table is a number that appears in the data set; the number below it is how many times it occurs. Thus the value is observed three times, that is, three of the measurements in the data set are , the value is observed six times, and so on. In the context of the problem this means that three women in the sample have had no children, six have had exactly one child, and so on. The explicit list of all the observations in this data set is thereforeThe sample size can be read directly from the table, without first listing the entire data set, as the sum of the frequencies: . The sample mean can be computed directly from the table as well:

In the examples above the data sets were described as samples. Therefore the means were sample means, denoted by . If the data come from a census, so that there is a measurement for every element of the population, then the mean is calculated by exactly the same process of summing all the measurements and dividing by how many of them there are, but it is now the *population mean* and is denoted by , the lower case Greek letter mu.

#### Definition

The**population mean**of

*a set of population data*

*is the number μ defined by the formula*The mean of two numbers is the number that is halfway between them. For example, the average of the numbers 5 and 17 is , which is units above and units below . In this sense the average is the "center" of the data set . For larger data sets the mean can similarly be regarded as the "center" of the data.

### The Median

To see why another concept of average is needed, consider the following situation. Suppose we are interested in the average yearly income of employees at a large corporation. We take a random sample of seven employees, obtaining the sample data (rounded to the nearest hundred dollars, and expressed in thousands of dollars).

The mean (rounded to one decimal place) is , but the statement "the average income of employees at this corporation is " is surely misleading. It is approximately twice what six of the seven employees in the sample make and is nowhere near what any of them makes. It is easy to see what went wrong: the presence of the one executive in the sample, whose salary is so large compared to everyone else's, caused the numerator in the formula for the sample mean to be far too large, pulling the mean far to the right of where we think that the average "ought" to be, namely around or . The number in our data set is called an**outlier**, a number that is far removed from most or all of the remaining measurements. Many times an outlier is the result of some sort of error, but not always, as is the case here. We would get a better measure of the "center" of the data if we were to arrange the data in numerical order,

then select the middle number in the list, in this case 24.6. The result is called the median of the data set, and has the property that roughly half of the measurements are larger than it is, and roughly half are smaller. In this sense it locates the center of the data. If there are an even number of measurements in the data set, then there will be two middle elements when all are lined up in order, so we take the mean of the middle two as the median. Thus we have the following definition.

#### Definition

*The ***sample median*** of a set of sample data for which there are an odd number of measurements is the middle measurement when the data are arranged in numerical order. The sample median of a set of sample data for which there are an even number of measurements is the mean of the two middle measurements when the data are arranged in numerical order.*

The population median is defined in a similar way, but we will not have occasion to refer to it again in this text.

The median is a value that divides the observations in a data set so that 50% of the data are on its left and the other 50% on its right. In accordance with Figure 2.6 "A Very Fine Relative Frequency Histogram", therefore, in the curve that represents the distribution of the data, a vertical line drawn at the median divides the area in two, area 0.5 (50% of the total area 1) to the left and area 0.5 (50% of the total area 1) to the right, as shown in Figure 2.7 "The Median". In our income example the median, $24,600, clearly gave a much better measure of the middle of the data set than did the mean $47,400. This is typical for situations in which the distribution is skewed. (Skewness and symmetry of distributions are discussed at the end of this subsection).

Figure 2.7 The Median

#### EXAMPLE 5

Compute the sample median for the data of Note 2.11 "Example 2".#### Solution:

The data in numerical order are . The two middle measurements are and , so .

#### EXAMPLE 6

Compute the sample median for the data of Note 2.12 "Example 3".#### Solution:

The data in numerical order are

The number of observations is ten, which is even, so there are two middle measurements, the fifth and sixth, which are and . Therefore the median of these data is .#### EXAMPLE 7

Compute the sample median for the data of Note 2.13 "Example 4".

#### Solution:

The data in numerical order are

The number of observations is , which is odd, so there is one middle measurement, the tenth. Since the tenth measurement is , the median is .

It is important to note that we could have computed the median without first explicitly listing all the observations in the data set. We already saw in Note 2.13 "Example 4" how to find the number of observations directly from the frequencies listed in the table: . As just above we figure out that the median is the tenth observation. The second line of the table in Note 2.13 "Example 4" shows that when the data are listed in order there will be three s followed by six s, so the tenth observation is a . The median is therefore .

The relationship between the mean and the median for several common shapes of distributions is shown in Figure 2.8 "Skewness of Relative Frequency Histograms". The distributions in panels (a) and (b) are said to be *symmetric *because of the symmetry that they exhibit. The distributions in the remaining two panels are said to be *skewed*. In each distribution we have drawn a vertical line that divides the area under the curve in half, which in accordance with Figure 2.7 "The Median" is located at the median. The following facts are true in general:

- When the distribution is symmetric, as in panels (a) and (b) of Figure 2.8 "Skewness of Relative Frequency Histograms", the mean and the median are equal.
- When the distribution is as shown in panel (c) of Figure 2.8 "Skewness of Relative Frequency Histograms", it is said to be skewed right. The mean has been pulled to the right of the median by the long "right tail" of the distribution, the few relatively large data values.
- When the distribution is as shown in panel (d) of Figure 2.8 "Skewness of Relative Frequency Histograms", it is said to be
*skewed left*. The mean has been pulled to the left of the median by the long "left tail" of the distribution, the few relatively small data values.

Figure 2.8 Skewness of Relative Frequency Histograms

### The Mode

Perhaps you have heard a statement like "The average number of automobiles owned by households in the United States is 1.37," and have been amused at the thought of a fraction of an automobile sitting in a driveway. In such a context the following measure for central location might make more sense.

#### Definition

The**sample mode**of a set of sample data is the most frequently occurring value.

The population mode is defined in a similar way, but we will not have occasion to refer to it again in this text.

On a relative frequency histogram, the highest point of the histogram corresponds to the mode of the data set. Figure 2.9 "Mode" illustrates the mode.

Figure 2.9 Mode

For any data set there is always exactly one mean and exactly one median. This need not be true of the mode; several different values could occur with the highest frequency, as we will see. It could even happen that every value occurs with the same frequency, in which case the concept of the mode does not make much sense.#### EXAMPLE 8

Find the mode of the following data set.#### Solution:

The value is most frequently observed and therefore the mode is .

#### EXAMPLE 9

Compute the sample mode for the data of Note 2.13 "Example 4".#### Solution:

The two most frequently observed values in the data set are and . Therefore mode is a set of two values: .

The mode is a measure of central location since most real-life data sets have more observations near the center of the data range and fewer observations on the lower and upper ends. The value with the highest frequency is often in the middle of the data range.

### KEY TAKEAWAY

The mean, the median, and the mode each answer the question "Where is the center of the data set?" The nature of the data set, as indicated by a relative frequency histogram, determines which one gives the best answer.

### EXERCISES

#### BASIC

- Find the mean, the median, and the mode for the sample

- Find the mean, the median, and the mode for the sample

- Find the mean, the median, and the mode for the sample data represented by the table

#### APPLICATIONS

- Find the mean and the median for the LDL cholesterol level in a sample of ten heart patients.

- Find the mean, the median, and the mode for the number of vehicles owned in a survey of 52 households.

- Five laboratory mice with thymus leukemia are observed for a predetermined period of 500 days. After 450 days, three mice have died, and one of the remaining mice is sacrificed for analysis. By the end of the observational period, the last remaining mouse still survives. The recorded survival times for the five mice are

where * indicates that the mouse survived for at least the given number of days but the exact value of the observation is unknown.

- Can you find the sample mean for the data set? If so, find it. If not, explain why not.
- Can you find the sample median for the data set? If so, find it. If not, explain why not.
- Cordelia records her daily commute time to work each day, to the nearest minute, for two months, and obtains the following data.

- Based on the frequencies, do you expect the mean and the median to be about the same or markedly different, and why?
- Compute the mean, the median, and the mode.
- A man tosses a coin repeatedly until it lands heads and records the number of tosses required. (For example, if it lands heads on the first toss he records a 1; if it lands tails on the first two tosses and heads on the third he records a 3). The data are shown.

- Find the mean of the data.
- Find the median of the data.

- Show that no matter what kind of average is used (mean, median, or mode) it is impossible for all members of a data set to be above average.

- Begin with the following set of data, call it Data Set I.

- Compute the mean, median, and mode.
- Form a new data set, Data Set II, by adding 3 to each number in Data Set I. Calculate the mean, median, and mode of Data Set II.
- Form a new data set, Data Set III, by subtracting 6 from each number in Data Set I. Calculate the mean, median, and mode of Data Set III.
- Comparing the answers to parts (a), (b), and (c), can you guess the pattern? State the general principle that you expect to be true.

#### LARGE DATA SET EXERCISES

Note: All of the data sets associated with these questions are missing, but the questions themselves are included here for reference.

- Large Data Set 1 lists the SAT scores of 1,000 students.

- Regard the data as arising from a census of all students at a high school, in which the SAT score of every student was measured. Compute the population mean .
- Regard the first 25 observations as a random sample drawn from this population. Compute the sample mean and compare it to .
- Regard the next 25 observations as a random sample drawn from this population. Compute the sample mean and compare it to .

- Large Data Sets 7, 7A, and 7B list the survival times in days of 140 laboratory mice with thymic leukemia from onset to death.
- Compute the mean and median survival time for all mice, without regard to gender.
- Compute the mean and median survival time for the 65 male mice (separately recorded in Large Data Set 7A).
- Compute the mean and median survival time for the 75 female mice (separately recorded in Large Data Set 7B).

### ANSWERS

- Mean: so dividing by yields , so the minimum value is not above average. Median: the middle measurement, or average of the two middle measurements, , is at least as large as , so the minimum value is not above average. Mode: the mode is one of the measurements, and is not greater than itself.

- a. , mode .

b. , mode

c.

d. If a number is added to every measurement in a data set, then the mean, median, and mode all change by that number.

## Measures of Variability

#### LEARNING OBJECTIVES

- To learn the concept of the variability of a data set.
- To learn how to compute three measures of the variability of a data set: the range, the variance, and the standard deviation.

Look at the two data sets in Table 2.1 "Two Data Sets" and the graphical representation of each, called a dot plot, in Figure 2.10 "Dot Plots of Data Sets".

Table 2.1 Two Data Sets

Data Set I: | 40 | 38 | 42 | 40 | 39 | 39 | 43 | 40 | 39 | 40 |

Data Set II: | 46 | 37 | 40 | 33 | 42 | 36 | 40 | 47 | 34 | 45 |

*Figure 2.10 Dot Plots of Data Sets*

The two sets of ten measurements each center at the same value: they both have mean, median, and mode 40. Nevertheless a glance at the figure shows that they are markedly different. In Data Set I the measurements vary only slightly from the center, while for Data Set II the measurements vary greatly. Just as we have attached numbers to a data set to locate its center, we now wish to associate to each data set numbers that measure quantitatively how the data either scatter away from the center or cluster close to it. These new quantities are called measures of variability, and we will discuss three of them.

### The Range

The first measure of variability that we discuss is the simplest.

#### Definition

*The ***range ***of a data set is the number R defined by the formula*

*where is the largest measurement in the data set and is the smallest.*

#### EXAMPLE 10

Find the range of each data set in Table 2.1 "Two Data Sets".

#### Solution:

For Data Set I the maximum is 43 and the minimum is 38, so the range is .

For Data Set II the maximum is 47 and the minimum is 33, so the range is .

The range is a measure of variability because it indicates the size of the interval over which the data points are distributed. A smaller range indicates less variability (less dispersion) among the data, whereas a larger range indicates the opposite.

### The Variance and the Standard Deviation

The other two measures of variability that we will consider are more elaborate and also depend on whether the data set is just a sample drawn from a much larger population or is the whole population itself (that is, a census).

#### Definition

*The ***sample variance ***of a set of sample data is the number defined by the formula*

*which by algebra is equivalent to the formula*

*The ***sample standard*** ***deviation ***of a set of sample data is the square root of the sample variance, hence is the number s given by the formulas*

Although the first formula in each case looks less complicated than the second, the latter is easier to use in hand computations, and is called a **shortcut formula**.

#### EXAMPLE 11

Find the sample variance and the sample standard deviation of Data Set II in Table 2.1 "Two Data Sets".

#### Solution:

To use the defining formula (the first formula) in the definition we first compute for each observation its deviation from the sample mean. Since the mean of the data is , we obtain the ten numbers displayed in the second line of the supplied table.

Then

so

and

The student is encouraged to compute the ten deviations for Data Set I and verify that their squares add up to 20 , so that the sample variance and standard deviation of Data Set I are the much smaller numbers and .

#### EXAMPLE 12

Find the sample variance and the sample standard deviation of the ten GPAs in Note 2.12 "Example 3" in Section 2.2 "Measures of Central Location".

#### Solution:

Since

and

the shortcut formula gives

and

The sample variance has different units from the data. For example, if the units in the data set were inches, the new units would be inches squared, or square inches. It is thus primarily of theoretical importance and will not be considered further in this text, except in passing.

If the data set comprises the whole population, then the population standard deviation, denoted (the lower case Greek letter sigma), and its square, the population variance , are defined as follows.

#### Definition

*The ***population variance*** ***population standard deviation*** of a set of population data are the numbers and defined by the formulas*

Note that the denominator in the fraction is the full number of observations, not that number reduced by one, as is the case with the sample standard deviation. Since most data sets are samples, we will always work with the sample standard deviation and variance.

Finally, in many real-life situations the most important statistical issues have to do with comparing the means and standard deviations of two data sets. Figure 2.11 "Difference between Two Data Sets" illustrates how a difference in one or both of the sample mean and the sample standard deviation are reflected in the appearance of the data set as shown by the curves derived from the relative frequency histograms built using the data.

Figure 2.11 Difference between Two Data Sets

### KEY TAKEAWAY

The range, the standard deviation, and the variance each give a quantitative answer to the question "How variable are the data?"

### EXERCISES

#### BASIC

1. Find the range, the variance, and the standard deviation for the following sample.

5. Find the range, the variance, and the standard deviation for the sample represented by the data frequency table.

#### APPLICATIONS

7. Find the range, the variance, and the standard deviation for the sample of ten IQ scores randomly selected from a school for academically gifted students.

#### ADDITIONAL EXERCISES

9. Consider the data set represented by the table

11. A random sample of 49 invoices for repairs at an automotive body shop is taken. The data are arrayed in the stem and leaf diagram shown. (Stems are thousands of dollars, leaves are hundreds, so that for example the largest observation is 3,800).

- Compute the mean, median, and mode.
- Compute the range.
- Compute the sample standard deviation.

13. A data set consisting of 25 measurements has standard deviation . One of the measurements has value 17. What are the other 24 measurements?

15. Create a sample data set of size for which the sample variance is and the sample mean is .

17. The sample has mean and standard deviation . Create a sample data set of size for which and the standard deviation is less than .

#### LARGE DATA SET EXERCISES

19. Large Data Set 1 lists the SAT scores and GPAs of 1,000 students.

http://www.gone.2012books.lardbucket.org/sites/all/files/data1.xls

- Compute the range and sample standard deviation of the 1,000 SAT scores.
- Compute the range and sample standard deviation of the 1,000 GPAs.

21. a. Regard the data as arising from a census of all freshman at a small college at the end of their first academic year of college study, in which the GPA of every such person was measured. Compute the population range and population standard deviation .