Once data has been collected and categorized, visualizations and fundamental calculations help describe the data. The visualization approaches (such as bar charts, histograms, and box plots) and calculations (such as mean, median, and standard deviation) introduced here will be revisited and implemented using Python.
Measures of the Spread of the Data
An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean.
The standard deviation
- provides a numerical measure of the overall amount of variation in a data set and
- can be used to determine whether a particular data value is close to or far from the mean.
The standard deviation provides a measure of the overall variation in a data set.
The standard deviation is always positive or zero. The standard deviation is small when all the data are concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.
Suppose that we are studying the amount of time customers wait in line at the checkout at Supermarket A and Supermarket B. The average wait time at both supermarkets is five minutes. At Supermarket A, the standard deviation for the wait time is two minutes; at Supermarket B, the standard deviation for the wait time is four minutes.
Because Supermarket B has a higher standard deviation, we know that there is more variation in the wait times at Supermarket B. Overall, wait times at Supermarket B are more spread out from the average whereas wait times at Supermarket A are more concentrated near the average.
The standard deviation can be used to determine whether a data value is close to or far from the mean.
Suppose that both Rosa and Binh shop at Supermarket A. Rosa waits at the checkout counter for seven minutes, and Binh waits for one minute. At Supermarket A, the mean waiting time is five minutes, and the standard deviation is two minutes. The standard deviation can be used to determine whether a data value is close to or far from the mean. A z-score is a standardized score that lets us compare data sets. It tells us how many standard deviations a data value is from the mean and is calculated as the ratio of the difference in a particular score and the population mean to the population standard deviation.
We can use the given information to create the table below.
Supermarket | Population Standard Deviation, σ | Individual Score, x | Population Mean, μ |
---|---|---|---|
Supermarket A | 2 minutes | 7, 1 | 5 |
Supermarket B | 4 minutes | 5 |
Table 2.31
We need the values from the first row to determine the number of standard deviations above or below the mean each individual wait time is; we can do so by calculating two different z-scores.
Rosa waited for seven minutes, so the z-score representing this deviation from the population mean may be calculated as
Binh waited for one minute, so the z-score representing this deviation from the population mean may be calculated as
A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average. Considering data to be far from the mean if they are more than two standard deviations away is more of an approximate rule of thumb than a rigid rule. In general, the shape of the distribution of the data affects how much of the data is farther away than two standard deviations. You will learn more about this in later chapters.
The number line may help you understand standard deviation. If we were to put five and seven on a number line, seven is to the right of five. We say, then, that seven is one standard deviation to the right of five because 5 + (1)(2) = 7.
If one were also part of the data set, then one is two standard deviations to the left of five because 5 + (–2)(2) = 1.

- In general, a value = mean + (#ofSTDEV)(standard deviation)
- where #ofSTDEVs = the number of standard deviations
- #ofSTDEV does not need to be an integer
- One is two standard deviations less than the mean of five because 1 = 5 + (–2)(2).
The lowercase letter s represents the sample standard deviation and the Greek letter σ (lower case) represents the population standard deviation.
Calculating the Standard Deviation
The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar but not identical. Therefore, the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lowercase letter s represents the sample standard deviation and the Greek letter σ (lowercase sigma) represents the population standard deviation. If the sample has the same characteristics as the population, then
To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the squares of the deviations (the
Formulas for the Sample Standard Deviation
Formulas for the Population Standard Deviation
or
- For the population standard deviation, the denominator is N, the number of items in the population.
Types of Variability in Samples
- Observational or measurement variability
- Natural variability
- Induced variability
- Sample variability
Example 1: Measurement variability
Measurement variability occurs when there are differences in the instruments used to measure or in the people using those instruments. If we are gathering data on how long it takes for a ball to drop from a height by having students measure the time of the drop with a stopwatch, we may experience measurement variability if the two stopwatches used were made by different manufacturers. For example, one stopwatch measures to the nearest second, whereas the other one measures to the nearest tenth of a second. We also may experience measurement variability because two different people are gathering the data. Their reaction times in pressing the button on the stopwatch may differ; thus, the outcomes will vary accordingly. The differences in outcomes may be affected by measurement variability.
Example 2: Natural variability
Natural variability arises from the differences that naturally occur because members of a population differ from each other. For example, if we have two identical corn plants and we expose both plants to the same amount of water and sunlight, they may still grow at different rates simply because they are two different corn plants. The difference in outcomes may be explained by natural variability.
Example 3: Induced variability
Induced variability is the counterpart to natural variability. This occurs because we have artificially induced an element of variation that, by definition, was not present naturally. For example, we assign people to two different groups to study memory, and we induce a variable in one group by limiting the amount of sleep they get. The difference in outcomes may be affected by induced variability.
Example 4: Sample variability
Sampling Variability of a Statistic
NOTE
Example 2.33
The average age is 10.53 years, rounded to two places.
Data | Frequency | Deviations | Deviations2 | (Frequency)(Deviations2) |
---|---|---|---|---|
x | f | (x – ) | (x – )2 | (f)(x – )2 |
9 | 1 | 9 – 10.525 = –1.525 | (–1.525)2 = 2.325625 | 1 × 2.325625 = 2.325625 |
9.5 | 2 | 9.5 – 10.525 = –1.025 | (–1.025)2 = 1.050625 | 2 × 1.050625 = 2.101250 |
10 | 4 | 10 – 10.525 = –.525 | (–.525)2 = .275625 | 4 × .275625 = 1.1025 |
10.5 | 4 | 10.5 – 10.525 = –.025 | (–.025)2 = .000625 | 4 × .000625 = .0025 |
11 | 6 | 11 – 10.525 = .475 | (.475)2 = .225625 | 6 × .225625 = 1.35375 |
11.5 | 3 | 11.5 – 10.525 = .975 | (.975)2 = .950625 | 3 × .950625 = 2.851875 |
|
|
|
|
The total is 9.7375. |
Table 2.32
The sample standard deviation s is equal to the square root of the sample variance:
Typically, you do the calculation for the standard deviation on your calculator or computer. The intermediate results are not rounded. This is done for accuracy.
- For the following problems, recall that value = mean + (#ofSTDEVs)(standard deviation). Verify the mean and standard deviation on a calculator or computer. Note that these formulas are derived by algebraically manipulating the z-score formulas, given either parameters or statistics.
- For a sample:
- For a population:
- For this example, use
because the data is from a sample
- Verify the mean and standard deviation on your calculator or computer.
- Find the value that is one standard deviation above the mean. Find
.
- Find the value that is two standard deviations below the mean. Find
.
- Find the values that are 1.5 standard deviations from (below and above) the mean.
Solution
- Using the TI-83, 83+, 84, 84+ Calculator
- Clear lists L1 and L2. Press STAT 4:ClrList. Enter 2nd 1 for L1, the comma (,), and 2nd 2 for L2.
- Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear the lists by arrowing up into the name. Press CLEAR and arrow down.
- Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the frequencies (1, 2, 4, 4, 6, 3) into list L2. Use the arrow keys to move around.
- Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 1), L2 (2nd 2). Do not forget the comma. Press ENTER.
.
- Use Sx because this is sample data (not a population):
.
Try It 2.33
21, 21, 22, 23, 24, 24, 25, 25, 28, 29, 29, 31, 32, 33, 33, 34, 35, 36, 36, 36, 36, 38, 38, 38, 40
Explanation of the standard deviation calculation shown in the table
The variance is a squared measure and does not have the same units as the data. Taking the square root solves the problem. The standard deviation measures the spread in the same units as the data.
NOTE
The standard deviation, s or σ, is either zero or larger than zero. Describing the data with reference to the spread is called variability. The variability in data depends on the method by which the outcomes are obtained, for example, by measuring or by random sampling. When the standard deviation is zero, there is no spread; that is, all the data values are equal to each other. The standard deviation is small when all the data are concentrated close to the mean and larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s or σ very large.
Example 2.34
33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100
- Create a chart containing the data, frequencies, relative frequencies, and cumulative relative frequencies to three decimal places.
- Calculate the following to one decimal place using a TI-83+ or TI-84 calculator:
- The sample mean
- The sample standard deviation
- The median
- The first quartile
- The third quartile
- IQR
- Construct a box plot and a histogram on the same set of axes. Make comments about the box plot, the histogram, and the chart.
Solution
- See Table 2.33.
- Entering the data values into a list in your graphing calculator and then selecting Stat, Calc, and 1-Var Stats will produce the one-variable statistics you need.
- The x-axis goes from 32.5 to 100.5; the y-axis goes from –2.4 to 15 for the histogram. The number of intervals is 5, so the width of an interval is (100.5 – 32.5) divided by 5, equal to 13.6. Endpoints of the intervals are as follows: the starting point is 32.5, 32.5 + 13.6 = 46.1, 46.1 + 13.6 = 59.7, 59.7 + 13.6 = 73.3, 73.3 + 13.6 = 86.9, 86.9 + 13.6 = 100.5 = the ending value; no data values fall on an interval boundary.
Figure 2.27
Data | Frequency | Relative Frequency | Cumulative Relative Frequency |
---|---|---|---|
33 | 1 | .032 | .032 |
42 | 1 | .032 | .064 |
49 | 2 | .065 | .129 |
53 | 1 | .032 | .161 |
55 | 2 | .065 | .226 |
61 | 1 | .032 | .258 |
63 | 1 | .032 | .290 |
67 | 1 | .032 | .322 |
68 | 2 | .065 | .387 |
69 | 2 | .065 | .452 |
72 | 1 | .032 | .484 |
73 | 1 | .032 | .516 |
74 | 1 | .032 | .548 |
78 | 1 | .032 | .580 |
80 | 1 | .032 | .612 |
83 | 1 | .032 | .644 |
88 | 3 | .097 | .741 |
90 | 1 | .032 | .773 |
92 | 1 | .032 | .805 |
94 | 4 | .129 | .934 |
96 | 1 | .032 | .966 |
100 | 1 | .032 | .998 (Why isn't this value 1?) |
Try It 2.34
Standard deviation of Grouped Frequency Tables
Example 2.35
Class | Frequency, f | Midpoint, m | m2 | 2 | fm2 | Standard Deviation |
---|---|---|---|---|---|---|
0–2 | 1 | 1 | 1 | 7.58 | 1 | 3.5 |
3–5 | 6 | 4 | 16 | 7.58 | 96 | 3.5 |
6–8 | 10 | 7 | 49 | 7.58 | 490 | 3.5 |
9–11 | 7 | 10 | 100 | 7.58 | 700 | 3.5 |
12–14 | 0 | 13 | 169 | 7.58 | 0 | 3.5 |
15–17 | 2 | 16 | 256 | 7.58 | 512 | 3.5 |
Table 2.34
Try It 2.35
Class | Frequency, f |
---|---|
0–2 | 1 |
3–5 | 6 |
6–8 | 10 |
9–11 | 7 |
12–14 | 0 |
15–17 | 2 |
Table 2.35




Comparing Values from Different Data Sets
- For each data value, calculate how many standard deviations away from its mean the value is.
- In symbols, the formulas for calculating z-scores become the following.
Example 2.36
Student | GPA | School Mean GPA | School Standard Deviation |
---|---|---|---|
John | 2.85 | 3.0 | .7 |
Ali | 77 | 80 | 10 |
Solution
For John,
For Ali,
John has the better GPA when compared to his school because his GPA is 0.21 standard deviations below his school's mean, while Ali's GPA is .3 standard deviations below his school's mean.
Try It 2.36
Swimmer | Time (seconds) | Team Mean Time | Team Standard Deviation |
---|---|---|---|
Angie | 26.2 | 27.2 | .8 |
Beth | 27.3 | 30.1 | 1.4 |
Table 2.38
For any data set, no matter what the distribution of the data is, the following are true:
- At least 75 percent of the data is within two standard deviations of the mean.
- At least 89 percent of the data is within three standard deviations of the mean.
- At least 95 percent of the data is within 4.5 standard deviations of the mean.
- This is known as Chebyshev's Rule.
For data having a distribution that is bell-shaped and symmetric, the following are true:
- Approximately 68 percent of the data is within one standard deviation of the mean.
- Approximately 95 percent of the data is within two standard deviations of the mean.
- More than 99 percent of the data is within three standard deviations of the mean.
- This is known as the Empirical Rule.
- It is important to note that this rule applies only when the shape of the distribution of the data is bell-shaped and symmetric; we will learn more about this when studying the Normal or Gaussian probability distribution in later chapters.