Numerical Measures of Central Tendency and Variability

 Site: Saylor Academy Course: MA121: Introduction to Statistics Book: Numerical Measures of Central Tendency and Variability
 Printed by: Guest user Date: Tuesday, July 16, 2024, 6:07 AM

Description

Read these sections and complete the questions at the end of each section. First, we will define central tendency and introduce mean, median, and mode. We will then elaborate on median and mean and discusses their strengths and weaknesses in measuring central tendency. Finally, we'll address variability, range, interquartile range, variance, and the standard deviation.

Central Tendency

Central tendency is a loosely defined concept that has to do with the location of the center of a distribution. The section "What is Central Tendency" presents three definitions of the center of a distribution. "Measures of Central Tendency" presents the three most common measures of the center of the distribution. The three simulations that follow relate the definitions of the center of a distribution to the commonly used measures of central tendency. The findings from these simulations are summarized in the section "Mean and Median". The "Mean and Median" allows you to explore how the relative size of the mean and the median depends on the skew of the distribution.

Less frequently used measures of central tendency can be valuable supplements to the more commonly used measures. Some of these measures are presented in "Additional Measures". Finally, the last section compares and summarizes differences among measures of central tendency.

Source: David M. Lane, https://onlinestatbook.com/2/summarizing_distributions/central_tendency.html
This work is in the Public Domain.

Measures of Central Tendency

Learning Objectives

1. Compute mean
2. Compute median
3. Compute mode

In the previous section we saw that there are several ways to define central tendency. This section defines the three most common measures of central tendency: the mean, the median, and the mode. The relationships among these measures of central tendency and the definitions given in the previous section will probably not be obvious to you. Rather than just tell you these relationships, we will allow you to discover them in the simulations in the sections that follow.

This section gives only the basic definitions of the mean, median, and mode. A further discussion of the relative merits and proper applications of these statistics is presented in a later section.

Arithmetic Mean

The arithmetic mean is the most common measure of central tendency. It is simply the sum of the numbers divided by the number of numbers. The symbol "$\mu$" is used for the mean of a population. The symbol "$\mathrm{M}$" is used for the mean of a sample. The formula for $\mu$ is shown below:

$\mu=\Sigma \mathrm{X} / \mathrm{N}$

where $\Sigma \mathrm{X}$ is the sum of all the numbers in the population and $N$ is the number of numbers in the population.

The formula for $M$ is essentially identical:

$\mathrm{M}=\Sigma X / N$

where $\Sigma \mathrm{X}$ is the sum of all the numbers in the sample and $\mathrm{N}$ is the number of numbers in the sample.

As an example, the mean of the numbers $1,2,3,6,8$ is $20 / 5=4$ regardless of whether the numbers constitute the entire population or just a sample from the population.

Table 1 shows the number of touchdown (TD) passes thrown by each of the 31 teams in the National Football League in the 2000 season. The mean number of touchdown passes thrown is 20.4516 as shown below.

\begin{aligned} \mu &=\Sigma \mathrm{X} / \mathrm{N} \\ &=634 / 31 \\ &=20.4516 \end{aligned}

Table 1. Number of touchdown passes.

 37 33 33 32 29 28 28 23 22 22 22 21 21 21 20 20 19 19 18 18 18 18 16 15 14 14 14 12 12 9 6

Although the arithmetic mean is not the only "mean" (there is also a geometric mean), it is by far the most commonly used. Therefore, if the term "mean" is used without specifying whether it is the arithmetic mean, the geometric mean, or some other mean, it is assumed to refer to the arithmetic mean.

Median

The median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores is above the median as below it. For the data in Table 1, there are $31$ scores. The 16th highest score (which equals $20$) is the median because there are $15$ scores below the $\mathrm{16th}$ score and $15$ scores above the $\mathrm{16th}$ score. The median can also be thought of as the $\mathrm{50th}$ percentile.

Computation of the Median

When there is an odd number of numbers, the median is simply the middle number. For example, the median of $2$, $4$, and $7$ is $4$. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers $2,4,7,12$ is $(4+7) / 2=5.5$. When there are numbers with the same values, then the formula for the third definition of the $\mathrm{50th}$ percentile should be used.

Mode

The mode is the most frequently occurring value. For the data in Table 1, the mode is 18 since more teams (4) had 18 touchdown passes than any other number of touchdown passes. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same (see discussion of continuous variables). Therefore the mode of continuous data is normally computed from a grouped frequency distribution. Table 2 shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650).

Table 2. Grouped frequency distribution.

Range Frequency
500-600
600-700
700-800
800-900
900-1000
1000-1100
3
6
5
5
0
1

R code

td=c(37,33,33,32,29,28,28,23,22,22 ,22, 21,21,21, 20, 20, 19,19,18,18,18,18,16,15,14,14,14,12,12,9,6)

mean(td) [1] 20.45161

median(td)

[1] 20

quantile(td, probs = c(.5), type = 6)
50%
20

z=c(2, 4, 7, 12)

median(z)
[1] 5.5

Questions

Question 1 out of 5.
What is the mean of 2, 4, 6, and 8?

________

Question 2 out of 5.
What is the median of -2, 4, 0, 3, and 8?

________

Question 3 out of 5.
What is the mode of -2, 4, 0, 3, 0, 2, 4, 4, and 8?

________

Question 4 out of 5.
Tom's test scores on his six tests are 95, 80, 75, 97, 75, 88. Which measure of central tendency would be the highest?

________

Question 5 out of 5.

Jane's test scores on her five tests are 90, 87, 70, 97, and 75. Her teacher is going to take the median of the test grades to calculate her final grade. Jane thinks she can argue and get two points back on some of the tests. Which test score(s) should she argue?

• 90
• 87
• 70
• 97
• 75
• As many as she can

1. $(2+4+6+8) / 4=5$

2. 3
Because there are 5 numbers, the median is the middle number when they are ranked from lowest to highest.

3. What is the mode of -2, 4, 0, 3, 0, 2, 4, 4, and 8?

4. Mean = 85, Median = 84, Mode = 75, so the mean of his scores is the highest.

5. If the teacher is going to use the median as the final grade, she should only argue the middle score (87). Changing the other scores by 2 points would not affect the median.

Median and Mean

Learning Objectives

1. State when the mean and median are the same
2. State whether it is the mean or median that minimizes the mean absolute deviation
3. State whether it is the mean or median that minimizes the mean squared deviation
4. State whether it is the mean or median that is the balance point on a balance scale

In the section "What is central tendency," we saw that the center of a distribution could be defined three ways: (1) the point on which a distribution would balance, (2) the value whose average absolute deviation from all the other values is minimized, and (3) the value whose average squared difference from all the other values is minimized. From the simulation in this chapter, you discovered (we hope) that the mean is the point on which a distribution would balance, the median is the value that minimizes the sum of absolute deviations, and the mean is the value that minimizes the sum of the squared deviations.

Table 1 shows the absolute and squared deviations of the numbers $2, \, 3, \, 4, \, 9,$ and $16$ from their median of $\mathrm{4}$ and their mean of 6.8. You can see that the sum of absolute deviations from the median ($\mathrm{20}$) is smaller than the sum of absolute deviations from the mean ($\mathrm{22.8}$). On the other hand, the sum of squared deviations from the median ($\mathrm{174}$) is larger than the sum of squared deviations from the mean ($\mathrm{134.8}$).

Table 1. Absolute and squared deviations from the median of $\mathrm{4}$ and the mean of $\mathrm{6.8}$.

Value Absolute Deviation from Median Absolute Deviation from Mean Squared Deviation from Median Squared Deviation from Mean
2 2 4.8 4 23.04
3 1 3.8 1 14.44
4 0 2.8 0 7.84
9 5 2.2 25 4.84
16 12 9.2 144 84.64
Total 20 22.8 174 134.8

Figure 1 shows that the distribution balances at the mean of $\mathrm{6.8}$ and not at the median of $\mathrm{4}$. The relative advantages and disadvantages of the mean and median are discussed in the section "Comparing Measures" later in this chapter.

Figure 1. The distribution balances at the mean of $\mathrm{6.8}$ and not at the median of $\mathrm{4.0}$.

When a distribution is symmetric, then the mean and the median are the same. Consider the following distribution: $\mathrm{1, \, 3, \, 4, \, 5, \, 6, \, 7, \, 9}$. The mean and median are both $\mathrm{5}$. The mean, median, and mode are identical in the bell-shaped normal distribution.

Questions

Question 1 out of 7.

The value that minimizes the sum of absolute deviations is the:

• mean
• median
• mode

Question 2 out of 7.

The point on which a distribution would balance is the:

• mean
• median
• mode

Question 3 out of 7.

The value that minimizes the sum of the squared deviations is the:

• mean
• median
• mode

Question 4 out of 7.

When are the mean and the median the same?

• When the distribution is very large
• When the distribution is symmetric
• When the distribution is skewed
• When the number that minimizes the sum of the squared deviations is the same as the balancing point
• Never

Question 5 out of 7.

For the numbers 17, 9, 20, 15, and 11, the number which minimizes the absolute deviations is:

__________

Question 6 out of 7.

For the numbers 20, 32, 18, 43, and 27, the number which minimizes the squared deviations is:

__________

Question 7 out of 7.

You have a distribution with a mean of 6.5, a median of 7, and a mode of 4. At what point does this distribution balance?

__________

1. This is a definition of the median.

2. This is a definition of the mean

3. This is a definition of the mean.

4. The mean and the median are only the same when a distribution is symmetric. The mean and median are different when the distribution is skewed.
5. 15: The median minimizes the absolute deviations. To find the median, order the numbers from smallest to largest, and then find the middle number.

6. The mean minimizes the squared deviations. To find the mean, find the sum of the values (140) and divide by number of values in your data set (5). $140/5 = 28$

7. The mean (in this case, 6.5) is the point at which a distribution balances.

Variability

Variability refers to how much the numbers in a distribution differ from each other. The most common measures are presented in "Measures of Variability". The "variability demo" allows you to change the standard deviation of a distribution and view a graph of the changed distribution.

One of the more counter-intuitive facts in introductory statistics is that the formula for variance when computed in a population is biased when applied in a sample. The "Estimating Variance Simulation" shows concretely why this is the case.

Measures of Variability

Learning Objectives

1. Determine the relative variability of two distributions
2. Compute the range
3. Compute the inter-quartile range
4. Compute the variance in the population
5. Estimate the variance from a sample
6. Compute the standard deviation from the variance
7. What is Variability?

What is Variability?

Variability refers to how "spread out" a group of scores is. To see what we mean by spread out, consider graphs in Figure 1. These graphs represent the scores on two quizzes. The mean score for each quiz is $\mathrm{7.0}$. Despite the equality of means, you can see that the distributions are quite different. Specifically, the scores on Quiz 1 are more densely packed and those on Quiz 2 are more spread out. The differences among students were much greater on Quiz 2 than on Quiz 1.

Quiz 2

Figure 1. Bar charts of two quizzes.

The terms variability, spread, and dispersion are synonyms, and refer to how spread out a distribution is. Just as in the section on central tendency where we discussed measures of the center of a distribution of scores, in this chapter we will discuss measures of the variability of a distribution. There are four frequently used measures of variability: the range, interquartile range, variance, and standard deviation. In the next few paragraphs, we will look at each of these four measures of variability in more detail.

Range

The range is the simplest measure of variability to calculate, and one you have probably encountered many times in your life. The range is simply the highest score minus the lowest score. Let's take a few examples. What is the range of the following group of numbers: $\mathrm{10, \, 2, \, 5, \, 6, \, 7, \, 3, \, 4}$? Well, the highest number is $\mathrm{10}$, and the lowest number is $\mathrm{2}$, so $\mathrm{10 - 2 = 8}$. The range is $\mathrm{8}$. Let's take another example. Here's a dataset with $\mathrm{10}$ numbers: $\mathrm{99, \, 45, \, 23, \, 67, \, 45, \, 91, \, 82, \, 78, \, 62, \, 51}$. What is the range? The highest number is $\mathrm{99}$ and the lowest number is $\mathrm{23}$, so $\mathrm{99 - 23}$ equals $\mathrm{76}$; the range is $\mathrm{76}$. Now consider the two quizzes shown in Figure 1. On Quiz 1, the lowest score is $\mathrm{5}$ and the highest score is $\mathrm{9}$. Therefore, the range is $\mathrm{4}$. The range on Quiz 2 was larger: the lowest score was $\mathrm{4}$ and the highest score was $10$. Therefore the range is $\mathrm{6}$.

Interquartile Range

The interquartile range (IQR) is the range of the middle $\mathrm{50\%}$ of the scores in a distribution. It is computed as follows:

$\mathrm{IQR} = \mathrm{75th} \, \; \text{percentile} - \mathrm{25th} \; \text{percentile}$

For Quiz 1, the $\mathrm{75th}$ percentile is $\mathrm{8}$ and the $\mathrm{25th}$ percentile is $\mathrm{6}$. The interquartile range is therefore $\mathrm{2}$. For Quiz 2, which has greater spread, the $\mathrm{75th}$ percentile is $\mathrm{9}$, the $\mathrm{25th}$ percentile is $\mathrm{5}$, and the interquartile range is $\mathrm{4}$. Recall that in the discussion of box plots, the $\mathrm{75th}$ percentile was called the upper hinge and the $\mathrm{25th}$ percentile was called the lower hinge. Using this terminology, the interquartile range is referred to as the H-spread.

A related measure of variability is called the semi-interquartile range. The semi-interquartile range is defined simply as the interquartile range divided by $\mathrm{2}$. If a distribution is symmetric, the median plus or minus the semi-interquartile range contains half the scores in the distribution.

Variance

Variability can also be defined in terms of how close the scores in the distribution are to the middle of the distribution. Using the mean as the measure of the middle of the distribution, the variance is defined as the average squared difference of the scores from the mean. The data from Quiz 1 are shown in Table 1. The mean score is $\mathrm{7.0}$. Therefore, the column "Deviation from Mean" contains the score minus $\mathrm{7}$. The column "Squared Deviation" is simply the previous column squared.

Table 1. Calculation of Variance for Quiz 1 scores.

Scores Deviation from Mean Squared Deviation
9 2 4
9 2 4
9 2 4
8 1 1
8 1 1
8 1 1
8 1 1
7 0 0
7 0 0
7 0 0
7 0 0
7 0 0
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
5 -2 4
5 -2 4
Means
7 0 1.5

One thing that is important to notice is that the mean deviation from the mean is 0. This will always be the case. The mean of the squared deviations is $\mathrm{1.5}$. Therefore, the variance is $\mathrm{1.5}$. Analogous calculations with Quiz 2 show that its variance is $\mathrm{6.7}$. The formula for the variance is:

$\sigma^{2}=\frac{\sum(X-\mu)^{2}}{N}$

where $\sigma^{2}$ is the variance, $\mu$ is the mean, and $N$ is the number of numbers. For Quiz $1, \mu=7$ and $N=20$

If the variance in a sample is used to estimate the variance in a population, then the previous formula underestimates the variance and the following formula? should be used:

$s^{2}=\frac{\sum(X-M)^{2}}{N-1}$

where $s^{2}$ is the estimate of the variance and $M$ is the sample mean. Note that $M$ is the mean of a sample taken from a population with a mean of $\mu$. Since, in practice, the variance is usually computed in a sample, this formula is most often used. The simulation "estimating variance" illustrates the bias in the formula with $N$ in the denominator.

Let's take a concrete example. Assume the scores $1,2,4$, and 5 were sampled from a larger population. To estimate the variance in the population you would compute $s^{2}$ as follows:

\begin{aligned} M &=(1+2+4+5) / 4=12 / 4=3 \\ S^{2} &=\left[(1-3)^{2}+(2-3)^{2}+(4-3)^{2}+(5-3)^{2}\right] /(4-1) \\ &=(4+1+1+4) / 3=10 / 3=3.333 \end{aligned}

There are alternate formulas that can be easier to use if you are doing your calculations with a hand calculator. You should note that these formulas are subject to rounding error if your values are very large and/or you have an extremely large number of observations.

$\sigma^{2}=\frac{\sum X^{2}-\frac{\left(\sum X\right)^{2}}{N}}{N}$

and

$s^{2}=\frac{\sum X^{2}-\frac{\left(\sum X\right)^{2}}{N}}{N-1}$

For this example,

\begin{aligned} &\sum X^{2}=1^{2}+2^{2}+4^{2}+5^{2}=46 \\ &\frac{\left(\sum X\right)^{2}}{N}=\frac{(1+2+4+5)^{2}}{4}=\frac{144}{4}=36 \\ &\sigma^{2}=\frac{(46-36)}{4}=2.5 \\ &s^{2}=\frac{(46-36)}{3}=3.333 \text { as with the other formula } \end{aligned}

Standard Deviation

The standard deviation is simply the square root of the variance. This makes the standard deviations of the two quiz distributions $\mathrm{1.257}$ and $\mathrm{2.203}$. The standard deviation is an especially useful measure of variability when the distribution is normal or approximately normal (see Chapter on Normal Distributions) because the proportion of the distribution within a given number of standard deviations from the mean can be calculated. For example, $\mathrm{68\%}$ of the distribution is within one standard deviation of the mean, and approximately $\mathrm{95\%}$ of the distribution is within two standard deviations of the mean. Therefore, if you had a normal distribution with a mean of $\mathrm{50}$ and a standard deviation of $\mathrm{10}$, then $\mathrm{68\%}$  of the distribution would be between $50 - 10 = 40$ and $50 +10 =60$. Similarly, about $\mathrm{95\%}$ of the distribution would be between $50 - 2 \times 10 = 30$ and $50 + 2 \times 10 = 70$. The symbol for the population standard deviation is σ; the symbol for an estimate computed in a sample is $\mathrm{s}$. Figure 2 shows two normal distributions. The red distribution has a mean of $\mathrm{40}$ and a standard deviation of $\mathrm{5}$; the blue distribution has a mean of $\mathrm{60}$ and a standard deviation of $\mathrm{10}$. For the red distribution, $\mathrm{68\%}$ of the distribution is between $\mathrm{35}$ and $\mathrm{45}$; for the blue distribution, $\mathrm{68\%}$ is between $\mathrm{50}$ and $\mathrm{70}$.

Figure 2. Normal distributions with standard deviations of $\mathrm{5}$ and $\mathrm{10}$.

R code

q1=c(9,9,9,8,8,8,8,7,7,7,7,7,6,6,6,6,6,6,5,5)
IQR(q1, type = 6)
[1] 2
x=c(1,2,4,5)
var(x)
[1] 3.333333
sd(q1)
[1] 1.256562
q2=c(10,10,9,9,9,8,8,8,7,7,7,6,6,6,5,5,4,4,3,3)
sd(q2)
[1] 2.202869

Questions

Question 1 out of 4.

What is the range of 2, 4, 6, and 8?

________

Question 2 out of 4.

Would the variance of 10, 12, 17, 20, 25, 27, 42, and 45 be larger if the numbers represented a population or a sample?

• Population
• Sample

Question 3 out of 4.

What is the standard deviation of this sample?

________

Y

8

5

11

11

15

9

7

18

Question 4 out of 4.

What is the interquartile range of these numbers?

________

Z

12

13

14

15

9

10

16

10

8

10

11

12

13

22

23

24

25

1. $8 - 2 = 6$
2. The variance would be larger if these numbers represented a sample because you would divide by $\mathrm{N-1}$ (instead of just $\mathrm{N}$).