Descriptive Statistics

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: Descriptive Statistics
Printed by: Guest user
Date: Wednesday, May 22, 2024, 4:01 PM

Description

Once data has been collected and categorized, visualizations and fundamental calculations help describe the data. The visualization approaches (such as bar charts, histograms, and box plots) and calculations (such as mean, median, and standard deviation) introduced here will be revisited and implemented using Python.

Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs

One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The stem consists of the leading digit(s), while the leaf consists of a final significant digit. For example, 23 has stem two and leaf three. The number 432 has stem 43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and leaf three. Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem. Make sure the leaves show a space between values, so that the exact data values may be easily determined. The frequency of data values for each stem provides information about the shape of the distribution.


Example 2.1

For Susan Dean's spring precalculus class, scores for the first exam were as follows (smallest to largest):
33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100

Stem Leaf
3 3
4 2 9 9
5 3 5 5
6 1 3 7 8 8 9 9
7 2 3 4 8
8 0 3 8 8 8
9 0 2 4 4 4 4 6
10 0

Table 2.1 Stem-and-Leaf Graph

The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 26 percent (\frac{8}{31}) were in the 90s or 100, a fairly high number of As.

Try It 2.1
For the Park City basketball team, scores for the last 30 games were as follows (smallest to largest):
32, 32, 33, 34, 38, 40, 42, 42, 43, 44, 46, 47, 47, 48, 48, 48, 49, 50, 50, 51, 52, 52, 52, 53, 54, 56, 57, 57, 60, 61
Construct a stemplot for the data.

The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes, for example, writing 50 instead of 500, while others may indicate that something unusual is happening. It takes some background information to explain outliers, so we will cover them in more detail later.

Example 2.2
The data are the distances (in kilometers) from a home to local supermarkets. Create a stemplot using the data.

1.1, 1.5, 2.3, 2.5, 2.7, 3.2, 3.3, 3.3, 3.5, 3.8, 4.0, 4.2, 4.5, 4.5, 4.7, 4.8, 5.5, 5.6, 6.5, 6.7, 12.3

Do the data seem to have any concentration of values?

The leaves are to the right of the decimal.

Solution
The value 12.3 may be an outlier. Values appear to concentrate at 3 and 4 kilometers.

Stem Leaf
1 1 5
2 3 5 7
3 2 3 3 5 8
4 0 2 5 5 7 8
5 5 6
6 5 7
7
8
9
10
11
12 3

Table 2.2

Try It 2.2
The data below show the distances (in miles) from the homes of high school students to the school. Create a stemplot using the following data and identify any outliers.

0.5, 0.7, 1.1, 1.2, 1.2, 1.3, 1.3, 1.5, 1.5, 1.7, 1.7, 1.8, 1.9, 2.0, 2.2, 2.5, 2.6, 2.8, 2.8, 2.8, 3.5, 3.8, 4.4, 4.8, 4.9, 5.2, 5.5, 5.7, 5.8, 8.0

Example 2.3
A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the stems. Table 2.3 and Table 2.4 show the ages of presidents at their inauguration and at their death. Construct a side-by-side stem-and-leaf plot using these data.

President Age President Age President Age
Washington 57 Lincoln 52 Hoover 54
J. Adams 61 A. Johnson 56 F. Roosevelt 51
Jefferson 57 Grant 46 Truman 60
Madison 57 Hayes 54 Eisenhower 62
Monroe 58 Garfield 49 Kennedy 43
J. Q. Adams 57 Arthur 51 L. Johnson 55
Jackson 61 Cleveland 47 Nixon 56
Van Buren 54 B. Harrison 55 Ford 61
W. H. Harrison 68 Cleveland 55 Carter 52
Tyler 51 McKinley 54 Reagan 69
Polk 49 T. Roosevelt 42 G.H.W. Bush 64
Taylor 64 Taft 51 Clinton 47
Fillmore 50 Wilson 56 G. W. Bush 54
Pierce 48 Harding 55 Obama 47
Buchanan 65 Coolidge 51
     
Table 2.3 Presidential Ages at Inauguration

President Age President Age President Age
Washington 67 Lincoln 56 Hoover 90
J. Adams 90 A. Johnson 66 F. Roosevelt 63
Jefferson 83 Grant 63 Truman 88
Madison 85 Hayes 70 Eisenhower 78
Monroe 73 Garfield 49 Kennedy 46
J. Q. Adams 80 Arthur 56 L. Johnson 64
Jackson 78 Cleveland 71 Nixon 81
Van Buren 79 B. Harrison 67 Ford 93
W. H. Harrison 68 Cleveland 71 Reagan 93
Tyler 71 McKinley 58

Polk 53 T. Roosevelt 60

Taylor 65 Taft 72

Fillmore 74 Wilson 67

Pierce 64 Harding 57

Buchanan 77 Coolidge 60
       
Table 2.4 Presidential Age at Death

Solution
Ages at Inauguration
Ages at Death
9 9 8 7 7 7 6 3 2 4 6 9
8 7 7 7 7 6 6 6 5 5 5 5 4 4 4 4 4 2 1 1 1 1 1 0 5 3 6 6 7 7 8
9 5 4 4 2 1 1 1 0 6 0 0 3 3 4 4 5 6 7 7 7 8

7 0 0 1 1 1 4 7 8 8 9

8 0 1 3 5 8

9 0 0 3 3

Table 2.5

Notice that the leaf values increase in order, from right to left, for leaves shown to the left of the stem, while the leaf values increase in order from left to right, for leaves shown to the right of the stem.

Try It 2.3
The table shows the number of wins and losses a sports team has had in 42 seasons. Create a side-by-side stem-and-leaf plot of these wins and losses.

Losses Wins Year Losses Wins Year
34 48 1968–1969 41 41 1989–1990
34 48 1969–1970 39 43 1990–1991
46 36 1970–1971 44 38 1991–1992
46 36 1971–1972 39 43 1992–1993
36 46 1972–1973 25 57 1993–1994
47 35 1973–1974 40 42 1994–1995
51 31 1974–1975 36 46 1995–1996
53 29 1975–1976 26 56 1996–1997
51 31 1976–1977 32 50 1997–1998
41 41 1977–1978 19 31 1998–1999
36 46 1978–1979 54 28 1999–2000
32 50 1979–1980 57 25 2000–2001
51 31 1980–1981 49 33 2001–2002
40 42 1981–1982 47 35 2002–2003
39 43 1982–1983 54 28 2003–2004
42 40 1983–1984 69 13 2004–2005
48 34 1984–1985 56 26 2005–2006
32 50 1985–1986 52 30 2006–2007
25 57 1986–1987 45 37 2007–2008
32 50 1987–1988 35 47 2008–2009
30 52 1988–1989 29 53 2009–2010

Table 2.6

Another type of graph that is useful for specific data values is a line graph. In the particular line graph shown in Example 2.4, the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points. The frequency points are connected using line segments.

Example 2.4
In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his or her chores. The results are shown in Table 2.7 and in Figure 2.2.

Number of Times Teenager Is Reminded Frequency
0 2
1 5
2 8
3 14
4 7
5 4

Table 2.7

A line graph showing the number of times a teenager needs to be reminded to do chores on the x-axis and frequency on the y-ax
Figure 2.2

Try It 2.4
In a survey, 40 people were asked how many times per year they had their car in the shop for repairs. The results are shown in Table 2.8. Construct a line graph.

Number of Times in Shop Frequency
0 7
1 10
2 14
3 9

Table 2.8

Bar graphs consist of bars that are separated from each other. The bars can be rectangles, or they can be rectangular boxes, used in three-dimensional plots, and they can be vertical or horizontal. The bar graph shown in Example 2.5 has age-groups represented on the x-axis and proportions on the y-axis.

Example 2.5
By the end of 2011, a social media site had more than 146 million users in the United States. Table 2.9 shows three age-groups, the number of users in each age-group, and the proportion (percentage) of users in each age-group. Construct a bar graph using this data.

Age-Groups Number of Site Users Proportion (%) of Site Users
13–25 65,082,280 45%
26–44 53,300,200 36%
45–64 27,885,100 19%

Table 2.9

Solution
This is a bar graph that matches the supplied data. The x-axis shows age groups, and the y-axis shows the percentages of Face
Figure 2.3

Try It 2.5
The population in Park City is made up of children, working-age adults, and retirees. Table 2.10 shows the three age-groups, the number of people in the town from each age-group, and the proportion (%) of people in each age-group. Construct a bar graph showing the proportions.

Age-Groups Number of People Proportion of Population
Children 67,059 19%
Working-age adults 152,198 43%
Retirees 131,662 38%

Table 2.10

Example 2.6
The columns in Table 2.11 contain the race or ethnicity of students in U.S. public schools for the class of 2011, percentages for the Advanced Placement (AP) examinee population for that class, and percentages for the overall student population. Create a bar graph with the student race or ethnicity (qualitative data) on the x-axis and the AP examinee population percentages on the y-axis.

Race/Ethnicity AP Examinee Population Overall Student Population
1 = Asian, Asian American, or Pacific Islander 10.3% 5.7%
2 = Black or African American 9.0% 14.7%
3 = Hispanic or Latino 17.0% 17.6%
4 = American Indian or Alaska Native 0.6% 1.1%
5 = White 57.1% 59.2%
6 = Not reported/other 6.0% 1.7%

Table 2.11

Solution
This is a bar graph that matches the supplied data. The x-axis shows race and ethnicity, and the y-axis shows the percentages
Figure 2.4

Try It 2.6
Park City is broken down into six voting districts. The table shows the percentage of the total registered voter population that lives in each district as well as the percentage of the entire population that lives in each district. Construct a bar graph that shows the registered voter population by district.

District Registered Voter Population Overall City Population
1 15.5% 19.4%
2 12.2% 15.6%
3 9.8% 9.0%
4 17.4% 18.5%
5 22.8% 20.7%
6 22.3% 16.8%

Table 2.12

Example 2.7
Table 2.13 is a two-way table showing the types of pets owned by men and women.


Dogs Cats Fish Total
Men 4 2 2 8
Women 4 6 2 12
Total 8 8 4 20

Table 2.13

Given these data, calculate the marginal distributions of pets for the people surveyed.

Solution

Dogs = 8/20 = 0.4

Cats = 8/20 = 0.4

Fish = 4/20 = 0.2

Note - The sum of all the marginal distributions must equal one. In this case,

0.4 + 0.4 + 0.2 = 1;

therefore, the solution checks.

Example 2.8
Table 2.14 is a two-way table showing the types of pets owned by men and women.


Dogs Cats Fish Total
Men 4 2 2 8
Women 4 6 2 12
Total 8 8 4 20

Table 2.14

Given these data, calculate the conditional distributions for the subpopulation of men who own each pet type.

Solution
Men who own dogs = 4/8 = 0.5

Men who own cats = 2/8 = 0.25

Men who own fish = 2/8 = 0.25

Note - The sum of all the conditional distributions must equal one. In this case,

0.5 + 0.25 + 0.25 = 1;

therefore, the solution checks.


Source: OpenStax, https://openstax.org/books/statistics/pages/2-introduction
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

Histograms, Frequency Polygons, and Time Series Graphs

For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets.

A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is more or less a number line, labeled with what the data represents, for example, distance from your home to school. The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data. The shape of the data refers to the shape of the distribution, whether normal, approximately normal, or skewed in some direction, whereas the center is thought of as the middle of a data set, and the spread indicates how far the values are dispersed about the center. In a skewed distribution, the mean is pulled toward the tail of the distribution.

The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample. Remember, frequency is defined as the number of times an answer occurs. If

  • f = frequency,
  • n = total number of data values (or the sum of the individual frequencies), and
  • RF = relative frequency,


then

RF=\frac{f}{n}.

For example, if three students in Mr. Ahab's English class of 40 students received from ninety to 100 percent, then f = 3, n = 40, and RF = \frac{f}{n} = \frac{3}{40} = 0.075. Thus, 7.5 percent of the students received 90 to 100 percent. Ninety to 100 percent is a quantitative measures.

To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity. The width of each bar is also referred to as the bin size, which may be calculated by dividing the range of the data values by the desired number of bins (or bars). There is not a set procedure for determining the number of bars or bar width/bin size; however, consistency is key when determining which data values to place inside each interval.


Example 2.9

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data since height is measured.

60, 60.5, 61, 61, 61.5,
63.5, 63.5, 63.5,
64, 64, 64, 64, 64, 64, 64, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5,
66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5,
68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69.5, 69.5, 69.5, 69.5, 69.5,
70, 70, 70, 70, 70, 70, 70.5, 70.5, 70.5, 71, 71, 71,
72, 72, 72, 72.5, 72.5, 73, 73.5,
74

The smallest data value is 60, and the largest data value is 74. To make sure each is included in an interval, we can use 59.95 as the smallest value and 74.05 as the largest value, subtracting and adding .05 to these values, respectively. We have a small range here of 14.1 (74.05 – 59.95), so we will want a fewer number of bins; let''s say eight. So, 14.1 divided by eight bins gives a bin size (or interval size) of approximately 1.76.


NOTE

We will round up to two and make each bar or class interval two units wide. Rounding up to two is a way to prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline that is followed by some for the width of a bar or class interval is to take the square root of the number of data values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, take the square root of 150 and round to 12 bars or intervals.

The boundaries are as follows:

  • 59.95
  • 59.95 + 2 = 61.95
  • 61.95 + 2 = 63.95
  • 63.95 + 2 = 65.95
  • 65.95 + 2 = 67.95
  • 67.95 + 2 = 69.95
  • 69.95 + 2 = 71.95
  • 71.95 + 2 = 73.95
  • 73.95 + 2 = 75.95

The heights 60 through 61.5 inches are in the interval 59.95–61.95. The heights that are 63.5 are in the interval 61.95–63.95. The heights that are 64 through 64.5 are in the interval 63.95–65.95. The heights 66 through 67.5 are in the interval 65.95–67.95. The heights 68 through 69.5 are in the interval 67.95–69.95. The heights 70 through 71 are in the interval 69.95–71.95. The heights 72 through 73.5 are in the interval 71.95–73.95. The height 74 is in the interval 73.95–75.95.

The following histogram displays the heights on the x-axis and relative frequency on the y-axis.

Histogram consists of 8 bars with the y-axis in increments of 0.05 from 0-0.4 and the x-axis in intervals of 2 from 59.95-75.

Figure 2.5

Interval Frequency Relative Frequency
59.95–61.95 5 5/100 = 0.05
61.95–63.95 3 3/100 = 0.03
63.95–65.95 15 15/100 = 0.15
65.95–67.95 40 40/100 = 0.40
67.95–69.95 17 17/100 = 0.17
69.95–71.95 12 12/100 = 0.12
71.95–73.95 7 7/100 = 0.07
73.95–75.95 1 1/100 = 0.01

Table 2.15

Try It 2.9
The following data are the shoe sizes of 50 male students. The sizes are continuous data since shoe size is measured. Construct a histogram and calculate the width of each bar or class interval. Use six bars on the histogram.

9, 9, 9.5, 9.5, 10, 10, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11.5, 11.5, 11.5, 11.5, 11.5, 11.5, 11.5,
12, 12, 12, 12, 12, 12, 12, 12.5, 12.5, 12.5, 12.5, 14

Example 2.10
The following data are the number of books bought by 50 part-time college students at ABC College. The number of books is discrete data since books are counted.

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5,
6, 6

Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy four books. Five students buy five books. Two students buy six books.

Calculate the width of each bar/bin size/interval size.

Solution
The smallest data value is 1, and the largest data value is 6. To make sure each is included in an interval, we can use 0.5 as the smallest value and 6.5 as the largest value by subtracting and adding 0.5 to these values. We have a small range here of 6 (6.5 – 0.5), so we will want a fewer number of bins; let's say six this time. So, six divided by six bins gives a bin size (or interval size) of one.

Notice that we may choose different rational numbers to add to, or subtract from, our maximum and minimum values when calculating bin size. In the previous example, we added and subtracted .05, while this time, we added and subtracted .5. Given a data set, you will be able to determine what is appropriate and reasonable.

The following histogram displays the number of books on the x-axis and the frequency on the y-axis.
Histogram consists of 6 bars with the y-axis in increments of 2 from 0-16 and the x-axis in intervals of 1 from 0.5-6.5.
Figure 2.6

Using the TI-83, 83+, 84, 84+ Calculator

Go to Appendix G. There are calculator instructions for entering data and for creating a customized histogram. Create the histogram for Example 2.10.

  • Press Y=. Press CLEAR to delete any equations.
  • Press STAT 1:EDIT. If L1 has data in it, arrow up into the name L1, press CLEAR and then arrow down. If necessary, do the same for L2.
  • Into L1, enter 1, 2, 3, 4, 5, 6. Note that these values represent the numbers of books.
  • Into L2, enter 11, 10, 16, 6, 5, 2. Note that these numbers represent the frequencies for the numbers of books.
  • Press WINDOW. Set Xmin = .5, Xscl = (6.5 – .5)/6, Ymin = –1, Ymax = 20, Yscl = 1, Xres = 1. The window settings are chosen to accurately and completely show the data value range and the frequency range.
  • Press second Y=. Start by pressing 4:Plotsoff ENTER.
  • Press second Y=. Press 1:Plot1. Press ENTER. Arrow down to TYPE. Arrow to the third picture (histogram). Press ENTER.
  • Arrow down to Xlist: Enter L1 (2nd 1). Arrow down to Freq. Enter L2 (second 2).
  • Press GRAPH.
  • Use the TRACE key and the arrow keys to examine the histogram.

Try It 2.10
The following data are the number of sports played by 50 student athletes. The number of sports is discrete data since sports are counted.

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3

Twenty student athletes play one sport. Twenty-two student athletes play two sports. Eight student athletes play three sports. Calculate a desired bin size for the data. Create a histogram and clearly label the endpoints of the intervals.

Example 2.11
Using this data set, construct a histogram.

Number of Hours My Classmates Spent Playing Video Games on Weekends
9.95 10 2.25 16.75 0
19.5 22.5 7.5 15 12.75
5.5 11 10 20.75 17.5
23 21.9 24 23.75 18
20 15 22.9 18.8 20.5

Table 2.16

Solution
This is a histogram that matches the supplied data. The x-axis consists of 5 bars in intervals of 5 from 0 to 25. The y-axis
Figure 2.7

Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if it falls on the left boundary but not if it falls on the right boundary. Different researchers may set up histograms for the same data in different ways. There is more than one correct way to set up a histogram.

Try It 2.11
The following data represent the number of employees at various restaurants in New York City. Using this data, create a histogram.

22, 35, 15, 26, 40, 28, 18, 20, 25, 34, 39, 42, 24, 22, 19, 27, 22, 34, 40, 20, 38, 28


Collaborative Exercise
Count the money (bills and change) in your pocket or purse. Your instructor will record the amounts. As a class, construct a histogram displaying the data. Discuss how many intervals you think would be appropriate. You may want to experiment with the number of intervals.


Frequency Polygons

Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons.

To construct a frequency polygon, first examine the data and decide on the number of intervals and resulting interval size, for both the x-axis and y-axis. The x-axis will show the lower and upper bound for each interval, containing the data values, whereas the y-axis will represent the frequencies of the values. Each data point represents the frequency for each interval. For example, if an interval has three data values in it, the frequency polygon will show a 3 at the upper endpoint of that interval. After choosing the appropriate intervals, begin plotting the data points. After all the points are plotted, draw line segments to connect them.

Example 2.12
A frequency polygon was constructed from the frequency table below.

Frequency Distribution for Calculus Final Test Scores
Lower Bound Upper Bound Frequency Cumulative Frequency
49.5 59.5 5 5
59.5 69.5 10 15
69.5 79.5 30 45
79.5 89.5 40 85
89.5 99.5 15 100

Table 2.17

A frequency polygon was constructed from the frequency table below.
Figure 2.8

Notice that each point represents frequency for a particular interval. These points are located halfway between the lower bound and upper bound. In fact, the horizontal axis, or x-axis, shows only these midpoint values. For the interval 49.5–59.5 the value 54.5 is represented by a point, showing the correct frequency of 5. For the interval occurring before 49.5–59.5, (as well as 39.5–49.5), the value of the midpoint, or 44.5, is represented by a point, showing a frequency of 0, since we do not have any values in that range. The same idea applies to the last interval of 99.5–109.5, which has a midpoint of 104.5 and correctly shows a point representing a frequency of 0. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side.

Try It 2.12
Construct a frequency polygon of U.S. presidents' ages at inauguration shown in Table 2.18.

Age at Inauguration Frequency
41.5–46.5 4
46.5–51.5 11
51.5–56.5 14
56.5–61.5 9
61.5–66.5 4
66.5–71.5 2

Table 2.18

Frequency polygons are useful for comparing distributions. This comparison is achieved by overlaying the frequency polygons drawn for different data sets.

Example 2.13
We will construct an overlay frequency polygon comparing the scores from Example 2.12 with the students' final numeric grades.

Frequency Distribution for Calculus Final Test Scores
Lower Bound Upper Bound Frequency Cumulative Frequency
49.5 59.5 5 5
59.5 69.5 10 15
69.5 79.5 30 45
79.5 89.5 40 85
89.5 99.5 15 100

Table 2.19

Frequency Distribution for Calculus Final Grades
Lower Bound Upper Bound Frequency Cumulative Frequency
49.5 59.5 10 10
59.5 69.5 10 20
69.5 79.5 30 50
79.5 89.5 45 95
89.5 99.5 5 100

Table 2.20
This is an overlay frequency polygon that matches the supplied data. The x-axis shows the grades, and the y-axis shows the fr
Figure 2.9

Suppose that we want to study the temperature range of a region for an entire month. Every day at noon, we note the temperature and write this down in a log. A variety of statistical studies could be done with these data. We could find the mean or the median temperature for the month. We could construct a histogram displaying the number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data that we have collected.

One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature reading for the day, we don't have to think of the data as being random. We can instead use the times given to impose a chronological order on the data. A graph that recognizes this ordering and displays the changing temperature as the month progresses is called a time series graph.


Constructing a Time Series Graph

To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By using the axes in that way, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.

Example 2.14
The following data show the Annual Consumer Price Index each month for 10 years. Construct a time series graph for the Annual Consumer Price Index data only.

Year Jan Feb Mar Apr May Jun Jul
2003 181.7 183.1 184.2 183.8 183.5 183.7 183.9
2004 185.2 186.2 187.4 188.0 189.1 189.7 189.4
2005 190.7 191.8 193.3 194.6 194.4 194.5 195.4
2006 198.3 198.7 199.8 201.5 202.5 202.9 203.5
2007 202.416 203.499 205.352 206.686 207.949 208.352 208.299
2008 211.080 211.693 213.528 214.823 216.632 218.815 219.964
2009 211.143 212.193 212.709 213.240 213.856 215.693 215.351
2010 216.687 216.741 217.631 218.009 218.178 217.965 218.011
2011 220.223 221.309 223.467 224.906 225.964 225.722 225.922
2012 226.665 227.663 229.392 230.085 229.815 229.478 229.104

Table 2.21

Year Aug Sep Oct Nov Dec Annual
2003 184.6 185.2 185.0 184.5 184.3 184.0
2004 189.5 189.9 190.9 191.0 190.3 188.9
2005 196.4 198.8 199.2 197.6 196.8 195.3
2006 203.9 202.9 201.8 201.5 201.8 201.6
2007 207.917 208.490 208.936 210.177 210.036 207.342
2008 219.086 218.783 216.573 212.425 210.228 215.303
2009 215.834 215.969 216.177 216.330 215.949 214.537
2010 218.312 218.439 218.711 218.803 219.179 218.056
2011 226.545 226.889 226.421 226.230 225.672 224.939
2012 230.379 231.407 231.317 230.221 229.601 229.594

Table 2.22

Solution
This is a times series graph that matches the supplied data. The x-axis shows years from 2003 to 2012, and the y-axis shows t
Figure 2.10 The annual amounts are plotted for each year. Then, consecutive points are connected with a line.

Try It 2.14
The following table is a portion of a data set from a banking website. Use the table to construct a time series graph for CO2 emissions for the United States.

CO2 Emissions
Ukraine United Kingdom United States
2003 352,259 540,640 5,681,664
2004 343,121 540,409 5,790,761
2005 339,029 541,990 5,826,394
2006 327,797 542,045 5,737,615
2007 328,357 528,631 5,828,697
2008 323,657 522,247 5,656,839
2009 272,176 474,579 5,299,563

Table 2.23

Uses of a Time Series Graph

Time series graphs are important tools in various applications of statistics. When a researcher records values of the same variable over an extended period of time, it is sometimes difficult for him or her to discern any trend or pattern. However, once the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to spot.

Measures of the Location of the Data

The common measures of location are quartiles and percentiles.

Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.

To calculate quartiles and percentiles, you must order the data from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. Recall that a percent means one-hundredth. So, percentiles mean the data is divided into 100 sections. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90 percent on a test. It means that 90 percent of test scores are the same as or less than your score and that 10 percent of the test scores are the same as or greater than your test score.

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That translates into a score of at least 1220.

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90 percent of the test scores are less, and not the same or less, than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the center of the data. You can think of the median as the middle value, but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data:

1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1

Ordered from smallest to largest:
1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5

When a data set has an even number of data values, the median is equal to the average of the two middle values when the data are arranged in ascending order (least to greatest). When a data set has an odd number of data values, the median is equal to the middle value when the data are arranged in ascending order.

Since there are 14 observations (an even number of data values), the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

\dfrac{6.8+7.2}{2}=7

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median, or second, quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:

1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5

The data set has an even number of values (14 data values), so the median will be the average of the two middle values (the average of 6.8 and 7.2), which is calculated as \dfrac{6.8+7.2}{2} and equals 7.

So, the median, or second quartile (Q2), is 7.

The first quartile is the median of the lower half of the data, so if we divide the data into seven values in the lower half and seven values in the upper half, we can see that we have an odd number of values in the lower half. Thus, the median of the lower half, or the first quartile (Q1) will be the middle value, or 2. Using the same procedure, we can see that the median of the upper half, or the third quartile (Q3) will be the middle value of the upper half, or 9.

The quartiles are illustrated below:
A number line is shown including the numbers 1, 1, 2, 2, 4, 6, 6.8, 7.23, 8, 8.3, 9, 10, 10, and 11.5. The following numbers
Figure 2.11

The interquartile range is a number that indicates the spread of the middle half, or the middle 50 percent of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1)

IQR = Q_3 – Q_1. The IQR for this data set is calculated as 9 minus 2, or 7.

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than 1.5 × IQR below the first quartile or more than 1.5 × IQR above the third quartile. Potential outliers always require further investigation.

NOTE
A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality, or they may be a key to understanding the data.

Example 2.15
For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.

389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000

Solution
Order the following data from smallest to largest:

114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000

M = 488,800

Q_1 = \dfrac{230,500 + 387,000}{2} = 308,750

Q_3 = \dfrac{639,000 + 659,000}{2} = 649,000

IQR = 649,000 – 308,750 = 340,250

(1.5)(IQR) = (1.5)(340,250) = 510,375

Q_1 – (1.5)(IQR) = 308,750 – 510,375 = –201,625

Q_3 + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.



Try It 2.15
For the 11 salaries, calculate the IQR and determine if any salaries are outliers. The following salaries are in dollars.

$33,000; $64,500; $28,000; $54,000; $72,000; $68,500; $69,000; $42,000; $54,000; $120,000; $40,500

In the example above, you just saw the calculation of the median, first quartile, and third quartile. These three values are part of the five number summary. The other two values are the minimum value (or min) and the maximum value (or max). The five number summary is used to create a box plot.

Try It 2.15
Find the interquartile range for the following two data sets and compare them.

Test Scores for Class A:
69, 96, 81, 79, 65, 76, 83, 99, 89, 67, 90, 77, 85, 98, 66, 91, 77, 69, 80, 94
Test Scores for Class B:
90, 72, 80, 92, 90, 97, 92, 75, 79, 68, 70, 80, 99, 95, 78, 73, 71, 68, 95, 100


Example 2.16
Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were as follows:

Amount of Sleep per School Night (Hours) Frequency Relative Frequency Cumulative Relative Frequency
4 2 .04 .04
5 5 .10 .14
6 7 .14 .28
7 12 .24 .52
8 14 .28 .80
9 7 .14 .94
10 3 .06 1.00

Table 2.24

Find the 28th percentile. Notice the .28 in the Cumulative Relative Frequency column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28th percentile is between the last six and the first seven. The 28th percentile is 6.5.

Find the median. Look again at the Cumulative Relative Frequency column and find .52. The median is the 50th percentile or the second quartile. Fifty percent of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and 11 of the 7s. The median or 50th percentile is between the 25th, or seven, and 26th, or seven, values. The median is seven.

Find the third quartile. The third quartile is the same as the 75th percentile. You can eyeball this answer. If you look at the Cumulative Relative Frequency column, you find .52 and .80. When you have all the fours, fives, sixes, and sevens, you have 52 percent of the data. When you include all the 8s, you have 80 percent of the data. The 75th percentile, then, must be an eight. Another way to look at the problem is to find 75 percent of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38th value, which is an eight. You can check this answer by counting the values. There are 37 values below the third quartile and 12 values above.

Try It 2.16
Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65th percentile.

Amount of Time Spent on Route (Hours) Frequency Relative Frequency Cumulative Relative Frequency
2 12 .30 .30
3 14 .35 .65
4 10 .25 .90
5 4 .10 1.00

Table 2.25

Example 2.17
Using Table 2.24:

  1. Find the 80th percentile.
  2. Find the 90th percentile.
  3. Find the first quartile. What is another name for the first quartile?
Solution
Using the data from the frequency table, we have the following:

  1. The 80th percentile is between the last eight and the first nine in the table (between the 40th and 41st values). Therefore, we need to take the mean of the 40th an 41st values. The 80th percentile = \dfrac{8+9}{2}=8.5.
  2. The 90th percentile will be the 45th data value (location is 0.90(50) = 45), and the 45th data value is nine.
  3. Q1 is also the 25th percentile. The 25th percentile location calculation: P_{25} = .25(50) = 12.5 ≈ 13, the 13th data value. Thus, the 25th percentile is six.


Try It 2.17
Refer to Table 2.25. Find the third quartile. What is another name for the third quartile?

Collaborative Exercise
Your instructor or a member of the class will ask everyone in class how many sweaters he or she owns. Answer the following questions:

  1. How many students were surveyed?
  2. What kind of sampling did you do?
  3. Construct two different histograms. For each, starting value = ________ and ending value = ________.
  4. Find the median, first quartile, and third quartile.
  5. Construct a table of the data to find the following:
    1. The 10th percentile
    2. The 70th percentile
    3. The percentage of students who own fewer than four sweaters

A Formula for Finding the kth Percentile

If you were to do a little research, you would find several formulas for calculating the kth percentile. Here is one of them.

k = the kth percentile. It may or may not be part of the data.

i = the index (ranking or position of a data value)

n = the total number of data

  • Order the data from smallest to largest.
  • Calculate i=\dfrac{k}{100}(n+1).
  • If i is an integer, then the kth percentile is the data value in the ith position in the ordered set of data.
  • If i is not an integer, then round i up and round i down to the nearest integers. Average the two data values in these two positions in the ordered data set. The formula and calculation are easier to understand in an example.

Example 2.18
Listed are 29 ages for Academy Award-winning best actors in order from smallest to largest:

18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47, 52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77

  1. Find the 70th percentile.
  2. Find the 83rd percentile.
Solution


    • k = 70
    • i = the index
    • n = 29

    i = \dfrac{k}{100} (n + 1) = (\dfrac{70}{100})(29 + 1) = 21. This equation tells us that i, or the position of the data value in the data set, is 21. So, we will count over to the 21st position, which shows a data value of 64.

  1. k = 83^{rd} percentile
    i = the index
    n = 29

    i  = \dfrac{k}{100} (n + 1) = (\dfrac{83}{100})(29 + 1) = 24.9, which is not an integer. Round it down to 24 and up to 25. The age in the 24th position is 71, and the age in the 25th position is 72. Average 71 and 72. The 83rd percentile is 71.5 years.

Try It 2.18
Listed are 29 ages for Academy Award-winning best actors in order from smallest to largest:

18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47, 52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77

Calculate the 20th percentile and the 55th percentile.


NOTE
You can calculate percentiles using calculators and computers. There are a variety of online calculators.


A Formula for Finding the Percentile of a Value in a Data Set

  • Order the data from smallest to largest.
  • x = the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile.
  • y = the number of data values equal to the data value for which you want to find the percentile.
  • n = the total number of data.
  • Calculate \dfrac{x+.5y}{n} (100). Then round to the nearest integer.

Example 2.19
Listed are 29 ages for Academy Award-winning best actors in order from smallest to largest:

18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47, 52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77

  1. Find the percentile for 58.
  2. Find the percentile for 25.
Solution
  1. Counting from the bottom of the list, there are 18 data values less than 58. There is one value of 58.
    x = 18 and y = 1.\dfrac{x+.5y}{n}(100) = \dfrac{18+.5(1)}{29}(100) = 63.80. Fifty-eight is the 64th percentile.

  2. Counting from the bottom of the list, there are three data values less than 25. There is one value of 25.
    x = 3 and y = 1.\dfrac{x+.5y}{n}(100) = \dfrac{3+.5(1)}{29}(100) = 12.07. Twenty-five is the 12th percentile.

Try It 2.19
Listed are 30 ages for Academy Award-winning best actors in order from smallest to largest:

18, 21, 22, 25, 26, 27, 29, 30, 31, 31, 33, 36, 37, 41, 42, 47, 52, 55, 57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77
Find the percentiles for 47 and 31.


Interpreting Percentiles, Quartiles, and Median

A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15 percent of data values are less than or equal to the 15th percentile.

  • Low percentiles always correspond to lower data values.
  • High percentiles always correspond to higher data values.
A percentile may or may not correspond to a value judgment about whether it is good or bad. The interpretation of whether a certain percentile is good or bad depends on the context of the situation to which the data apply. In some situations, a low percentile would be considered good; in other contexts a high percentile might be considered good. In many situations, there is no value judgment that applies. A high percentile on a standardized test is considered good, while a lower percentile on body mass index might be considered good. A percentile associated with a person's height doesn't carry any value judgment.

Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text.

Guideline

When writing the interpretation of a percentile in the context of the given data, make sure the sentence contains the following information:
  • Information about the context of the situation being considered
  • The data value (value of the variable) that represents the percentile
  • The percentage of individuals or items with data values below the percentile
  • The percentage of individuals or items with data values above the percentile

Example 2.20
On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.

Solution
  • Twenty-five percent of students finished the exam in 35 minutes or less.
  • Seventy-five percent of students finished the exam in 35 minutes or more.
  • A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. If you take too long, you might not be able to finish.

Try It 2.20
For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret the third quartile in the context of the situation.

Example 2.21
On a 20-question math test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.

Solution
  • Seventy percent of students answered 16 or fewer questions correctly.
  • Thirty percent of students answered 16 or more questions correctly.
  • A higher percentile could be considered good, as answering more questions correctly is desirable.


Try It 2.21
On a 60-point written assignment, the 80th percentile for the number of points earned was 49. Interpret the 80th percentile in the context of this situation.


Example 2.22
At a high school, it was found that the 30th percentile of number of hours that students spend studying per week is seven hours. Interpret the 30th percentile in the context of this situation.

Solution
  • Seventy percent of students answered 16 or fewer questions correctly.
  • Thirty percent of students answered 16 or more questions correctly.
  • A higher percentile could be considered good, as answering more questions correctly is desirable.

Try It 2.22
During a season, the 40th percentile for points scored per player in a game is eight. Interpret the 40th percentile in the context of this situation.


Example 2.23
A middle school is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students are shown:

0 minutes, 40 minutes, 60 minutes, 30 minutes, 60 minutes,

10 minutes, 45 minutes, 30 minutes, 300 minutes, 90 minutes,

30 minutes, 120 minutes, 60 minutes, 0 minutes, 20 minutes

Find the five values that make up the five number summary.

Min = 0

Q1 = 20

Med = 40

Q3 = 60

Max = 300

Listing the data in ascending order gives the following:

A number line is shown including the numbers 0, 0, 10, 20, 30, 30, 30, 40, 45, 60, 60, 60, 90, 120, and 300. The following nu
Figure 2.12

The minimum value is 0.

The maximum value is 300.

Since there are an odd number of data values, the median is the middle value of this data set as it is arranged in ascending order, or 40.

The first quartile is the median of the lower half of the scores and does not include the median. The lower half has seven data values; the median of the lower half will equal the middle value of the lower half, or 20.

The third quartile is the median of the upper half of the scores and does not include the median. The upper half also has seven data values; so the median of the upper half will equal the middle value of the upper half, or 60.

If you were the principal, would you be justified in purchasing new fitness equipment? Since 75 percent of the students exercise for 60 minutes or less daily, and since the IQR is 40 minutes (60 – 20 = 40), we know that half of the students surveyed exercise between 20 minutes and 60 minutes daily. This seems a reasonable amount of time spent exercising, so the principal would be justified in purchasing the new equipment.

However, the principal needs to be careful. The value 300 appears to be a potential outlier.

Q3 + 1.5(IQR) = 60 + (1.5)(40) = 120.

The value 300 is greater than 120, so it is a potential outlier. If we delete it and calculate the five values, we get the following values:
  • Min = 0
  • Q1 = 20
  • Q3 = 60
  • Max = 120
We still have 75 percent of the students exercising for 60 minutes or less daily and half of the students exercising between 20 and 60 minutes a day. However, 15 students is a small sample, and the principal should survey more students to be sure of his survey results.

Box Plots

Box plots, also called box-and-whisker plots or box-whisker plots, give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data. As mentioned previously, a box plot is constructed from five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value. We use these values to compare how close other data values are to them.

To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box, and the third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall inside the box. The whiskers extend from the ends of the box to the smallest and largest data values. A box plot easily shows the range of a data set, which is the difference between the largest and smallest data values (or the difference between the maximum and minimum). Unless the median, first quartile, and third quartile are the same value, the median will lie inside the box or between the first and third quartiles. The box plot gives a good, quick picture of the data.


NOTE

You may encounter box-and-whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values.

Consider, again, this data set:

1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5

The first quartile is two, the median is seven, and the third quartile is nine. The smallest value is one, and the largest value is 11.5. The following image shows the constructed box plot.


NOTE

See the calculator instructions on the TI website or in the appendix.

Horizontal boxplot's first whisker extends from the smallest value, 1, to the first quartile, 2, the box begins at the first
Figure 2.13


The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown with a dashed line.

NOTE
It is important to start a box plot with a scaled number line. Otherwise, the box plot may not be useful.


Example 2.24

The following data are the heights of 40 students in a statistics class:

59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 67, 67, 68, 68, 69, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 77.

Construct a box plot with the following properties. Calculator instructions for finding the five number summary follow this example:

  • Minimum value = 59
  • Maximum value = 77
  • Q1: First quartile = 64.5
  • Q2: Second quartile or median = 66
  • Q3: Third quartile = 70

Horizontal boxplot with first whisker extending from smallest value, 59, to Q1, 64.5, box beginning from Q1 to Q3, 70, median

Figure 2.14

  1. Each quarter has approximately 25 percent of the data.
  2. The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (third quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread, and the fourth quarter has the largest spread.
  3. Range = maximum value – minimum value = 77 – 59 = 18.
  4. Interquartile Range: IQR = Q3 – Q1 = 70 – 64.5 = 5.5.
  5. The interval 59–65 has more than 25 percent of the data, so it has more data in it than the interval 66–70, which has 25 percent of the data.
  6. The middle 50 percent (middle half) of the data has a range of 5.5 inches.


Using the TI-83, 83+, 84, 84+ Calculator

To find the minimum, maximum, and quartiles:

Enter data into the list editor (Pres STAT 1:EDIT). If you need to clear the list, arrow up to the name L1, press CLEAR, and then arrow down.

Put the data values into the list L1.

Press STAT and arrow to CALC. Press 1:1-VarStats. Enter L1.

Press ENTER.

Use the down and up arrow keys to scroll.

Smallest value = 59.

Largest value = 77.

Q1: First quartile = 64.5.

Q2: Second quartile or median = 66.

Q3: Third quartile = 70.

To construct the box plot:

Press 4:Plotsoff. Press ENTER.

Arrow down and then use the right arrow key to go to the fifth picture, which is the box plot. Press ENTER.

Arrow down to Xlist: Press 2nd 1 for L1.

Arrow down to Freq: Press ALPHA. Press 1.

Press Zoom. Press 9: ZoomStat.

Press TRACE and use the arrow keys to examine the box plot.


Try It 2.24

The following data are the number of pages in 40 books on a shelf. Construct a box plot using a graphing calculator and state the interquartile range.

136, 140, 178, 190, 205, 215, 217, 218, 232, 234, 240, 255, 270, 275, 290, 301, 303, 315, 317, 318, 326, 333, 343, 349, 360, 369, 377, 388, 391, 392, 398, 400, 402, 405, 408, 422, 429, 450, 475, 512

For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both the third quartile and the median. For example, if the smallest value and the first quartile were both one, the median and the third quartile were both five, and the largest value was seven, the box plot would look like the following:

Horizontal boxplot box begins at the smallest value and Q1, 1, until the Q3 and median, 5, no median line is designated, and

Figure 2.15

In this case, at least 25 percent of the values are equal to one. Twenty-five percent of the values are between one and five, inclusive. At least 25 percent of the values are equal to five. The top 25 percent of the values fall between five and seven, inclusive.


Example 2.25

Test scores for Mr. Ramirez's class held during the day are as follows:

99, 56, 78, 55.5, 32, 90, 80, 81, 56, 59, 45, 77, 84.5, 84, 70, 72, 68, 32, 79, 90.

Test scores for Ms. Park's class held during the evening are as follows:

98, 78, 68, 83, 81, 89, 88, 76, 65, 45, 98, 90, 80, 84.5, 85, 79, 78, 98, 90, 79, 81, 25.5.

  1. Find the smallest and largest values, the median, and the first and third quartile for Mr. Ramirez's class.
  2. Find the smallest and largest values, the median, and the first and third quartile for Ms. Park's class.
  3. For each data set, what percentage of the data is between the smallest value and the first quartile? the first quartile and the median? the median and the third quartile? the third quartile and the largest value? What percentage of the data is between the first quartile and the largest value?
  4. Create a box plot for each set of data. Use one number line for both box plots.
  5. Which box plot has the widest spread for the middle 50 percent of the data,the data between the first and third quartiles? What does this mean for that set of data in comparison to the other set of data?
Solution

  1. Min = 32
    Q1 = 56
    M = 74.5
    Q3 = 82.5
    Max = 99

  2. Min = 25.5
    Q1 = 78
    M = 81
    Q3 = 89
    Max = 98

  3. Mr. Ramirez's class: There are six data values ranging from 32 to 56: 30 percent. There are six data values ranging from 56 to 74.5: 30 percent. There are five data values ranging from 74.5 to 82.5: 25 percent. There are five data values ranging from 82.5 to 99: 25 percent. There are 16 data values between the first quartile, 56, and the largest value, 99: 75 percent. Ms. Park’s class: There are six data values ranging from 25.5 to 78: 27 percent. There are five data values ranging from 78 to the first instance of 81: 23 percent. There are six data values ranging from the second instance of 81 to 89: 27 percent. There are five data values ranging from 90 to 98: 23 percent. There are 17 values between the first quartile, 78, and the largest value, 98: 77 percent.

  4. Two box plots over a number line from 0 to 100. The top plot shows a whisker from 32 to 56, a solid line at 56, a dashed line

    Figure 2.16
  5. The first data set has the wider spread for the middle 50 percent of the data. The IQR for the first data set is greater than the IQR for the second set. This means that there is more variability in the middle 50 percent of the first data set.


Try It 2.25

The following data set shows the heights in inches for the boys in a class of 40 students:

66, 66, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 71, 72, 72, 72, 73, 73, 74.
The following data set shows the heights in inches for the girls in a class of 40 students:

61 61, 62, 62, 63, 63, 63, 65, 65, 65, 66, 66, 66, 67, 68, 68, 68, 69, 69, 69.
Construct a box plot using a graphing calculator for each data set, and state which box plot has the wider spread for the middle 50 percent of the data.


Example 2.26

Graph a box-and-whisker plot for the following data values shown:

10, 10, 10, 15, 35, 75, 90, 95, 100, 175, 420, 490, 515, 515, 790

The five numbers used to create a box-and-whisker plot are as follows:

  • Min: 10
  • Q1: 15
  • Med: 95
  • Q3: 490
  • Max: 790

The following graph shows the box-and-whisker plot.

This graph and whisker plot shows the numbers 10, 15, 95, 490, and 790 represented on a line plot. The markings are from 10 t

Figure 2.17


Try It 2.26

Follow the steps you used to graph a box-and-whisker plot for the data values shown:

0, 5, 5, 15, 30, 30, 45, 50, 50, 60, 75, 110, 140, 240, 330

Measures of the Center of the Data

The center of a data set is also a way of describing location. The two most widely used measures of the center of the data are the mean (average) and the median. To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts. The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.


NOTE

The words mean and average are often used interchangeably. The substitution of one word for the other is common practice. The technical term is arithmetic mean and average is technically a center location. However, in practice among non statisticians, average is commonly accepted for arithmetic mean.

When each value in the data set is not unique, the mean can be calculated by multiplying each distinct value by its frequency and then dividing the sum by the total number of data values. The letter used to represent the sample mean is an x with a bar over it (pronounced "x bar"): \overline x. The sample mean is a statistic.

The Greek letter μ (pronounced "mew") represents the population mean. The population mean is a parameter. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.

To see that both ways of calculating the mean are the same, consider the following sample:

1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4

\overline x=\dfrac{1+1+1+2+2+3+4+4+4+4+4}{11}=2.7

\overline x=\dfrac{3(1)+2(2)+1(3)+5(4)}{11}=2.7.

In the second example, the frequencies are 3(1) + 2(2) + 1(3) + 5(4).

You can quickly find the location of the median by using the expression \dfrac{n+1}{2}.

The letter n is the total number of data values in the sample. As discussed earlier, if n is an odd number, the median is the middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is equal to the two middle values added together and divided by two after the data have been ordered. For example, if the total number of data values is 97, then \dfrac{n+1}{2} = \dfrac{97+1}{2} = 49. The median is the 49th value in the ordered data. If the total number of data values is 100, then \dfrac{n+1}{2}=\dfrac{100+1}{2}= 50.5. The median occurs midway between the 50th and 51st values. The location of the median and the value of the median are not the same. The uppercase letter M is often used to represent the median. The next example illustrates the location of the median and the value of the median.


Example 2.27

Data indicating the number of months a patient with a specific disease lives after taking a new antibody drug are as follows (smallest to largest):

3, 4, 8, 8, 10, 11, 12, 13, 14, 15, 15, 16, 16, 17, 17, 18, 21, 22, 22, 24, 24, 25, 26, 26, 27, 27, 29, 29, 31, 32, 33, 33, 34, 34, 35, 37, 40, 44, 44, 47

Calculate the mean and the median.

Solution

The calculation for the mean is
\overline x=[3+4+(8)(2)+10+11+12+13+14+(15)(2)+(16)(2)+(17)(2)+18+21+(22)(2)+(24)(2)+25+(26)(2)+(27)(2)+(29)(2)+31+32+(33)(2)+(34)(2)+35+37+40+(44)(2)+47]/40=23.6

To find the median, M, first use the formula for the location. The location is

\dfrac{n+1}{2}=\dfrac{40+1}{2}=20.5.


Start from the smallest value and count up; the median is located between the 20th and 21st values (the two 24s):

3, 4, 8, 8, 10, 11, 12, 13, 14, 15, 15, 16, 16, 17, 17, 18, 21, 22, 22, 24, 24, 25, 26, 26, 27, 27, 29, 29, 31, 32, 33, 33, 34, 34, 35, 37, 40, 44, 44, 47

M=\dfrac{24+24}{2}=24


Using the TI-83, 83+, 84, 84+ Calculator


To find the mean and the median:

Clear list L1. Pres STAT 4:ClrList. Enter 2nd 1 for list L1. Press ENTER.

Enter data into the list editor. Press STAT 1:EDIT.

Put the data values into list L1.

Press STAT and arrow to CALC. Press 1:1-VarStats. Press 2nd 1 for L1 and then ENTER.

Press the down and up arrow keys to scroll.

\overline x = 23.6, M = 24


Try It 2.27

The following data show the number of months patients typically wait on a transplant list before getting surgery. The data are ordered from smallest to largest. Calculate the mean and median.

3, 4, 5, 7, 7, 7, 7, 8, 8, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 13, 14, 14, 15, 15, 17, 17, 18, 19, 19, 19, 21, 21, 22, 22, 23, 24, 24, 24, 24


Example 2.28

Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the center: the mean or the median?

Solution

\overline x=5,000,000+49(30,000)50=129,400

M = 30,000

There are 49 people who earn $30,000 and one person who earns $5,000,000.

The median is a better measure of the center than the mean because 49 of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.


Try It 2.28

In a sample of 60 households, one house is worth $2,500,000. Half of the rest are worth $280,000, and all the others are worth $315,000. Which is the better measure of the center: the mean or the median?

Another measure of the center is the mode. The mode is the most frequent value. There can be more than one mode in a data set as long as those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal.


Example 2.29

Statistics exam scores for 20 students are as follows:

50, 53, 59, 59, 63, 63, 72, 72, 72, 72, 72, 76, 78, 81, 83, 84, 84, 84, 90, 93

Find the mode.

Solution

The most frequent score is 72, which occurs five times. Mode = 72.


Try It 2.29

The number of books checked out from the library by 25 students are as follows:

0, 0, 0, 1, 2, 3, 3, 4, 4, 5, 5, 7, 7, 7, 7, 8, 8, 8, 9, 10, 10, 11, 11, 12, 12
Find the mode.


Example 2.30

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 and 480 each occur twice.

When is the mode the best measure of the center? Consider a weight loss program that advertises a mean weight loss of six pounds the first week of the program. The mode might indicate that most people lose two pounds the first week, making the program less appealing.

NOTE

The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data set is red, red, red, green, green, yellow, purple, black, blue, the mode is red.

Statistical software will easily calculate the mean, the median, and the mode. Some graphing calculators can also make these calculations. In the real world, people make these calculations using software.


Try It 2.30

Five credit scores are 680, 680, 700, 720, 720. The data set is bimodal because the scores 680 and 720 each occur twice. Consider the annual earnings of workers at a factory. The mode is $25,000 and occurs 150 times out of 301. The median is $50,000, and the mean is $47,500. What would be the best measure of the center?


The Law of Large Numbers and the Mean

The Law of Large Numbers says that if you take samples of larger and larger size from any population, then the mean \overline x of the sample is very likely to get closer and closer to µ. This law is discussed in more detail later in the text.


Sampling Distributions and Statistic of a Sampling Distribution

You can think of a sampling distribution as a relative frequency distribution with a great many samples. See Chapter 1: Sampling and Data for a review of relative frequency. Suppose 30 randomly selected students were asked the number of movies they watched the previous week. The results are in the relative frequency table shown below.

Number of Movies    
Relative Frequency
0 \dfrac{5}{30}
1 \dfrac{15}{30}
2 \dfrac{6}{30}
3 \dfrac{3}{30}
 4  \dfrac{1}{30}

Table 2.26


A relative frequency distribution includes the relative frequencies of a number of samples.

Recall that a statistic is a number calculated from a sample. Statistic examples include the mean, the median, and the mode as well as others. The sample mean \overline x is an example of a statistic that estimates the population mean μ.


Calculating the Mean of Grouped Frequency Tables

When only grouped data is available, you do not know the individual data values (we know only intervals and interval frequencies); therefore, you cannot compute an exact mean for the data set. What we must do is estimate the actual mean by calculating the mean of a frequency table. A frequency table is a data representation in which grouped data is displayed along with the corresponding frequencies. To calculate the mean from a grouped frequency table, we can apply the basic definition of mean: \text{mean} =\dfrac{\text{data sum}}{\text{number of data values}}. We simply need to modify the definition to fit within the restrictions of a frequency table.

Since we do not know the individual data values, we can instead find the midpoint of each interval. The midpoint is \dfrac{\text{lower boundary+upper boundary}}{2}. We can now modify the mean definition to be \text{Mean of Frequency Table}=\dfrac{∑fm}{∑f}, where f = the frequency of the interval, m\0 = the midpoint of the interval, and sigma (\(∑) is read as "sigma" and means to sum up. So this formula says that we will sum the products of each midpoint and the corresponding frequency and divide by the sum of all of the frequencies.


Example 2.31

A frequency table displaying Professor Blount's last statistic test is shown. Find the best estimate of the class mean.

Grade Interval Number of Students
50–56.5 1
56.5–62.5 0
62.5–68.5 4
68.5–74.5 4
74.5–80.5 2
80.5–86.5 3
86.5–92.5 4
92.5–98.5 1

Table 2.27

Solution
  • Find the midpoints for all intervals.
Grade Interval Midpoint
50–56.5 53.25
56.5–62.5 59.5
62.5–68.5 65.5
68.5–74.5 71.5
74.5–80.5 77.5
80.5–86.5 83.5
86.5–92.5 89.5
92.5–98.5 95.5

Table 2.28

  • Calculate the sum of the product of each interval frequency and midpoint. ∑​fm 53.25(1)+59.5(0)+65.5(4)+71.5(4)+77.5(2)+83.5(3)+89.5(4)+95.5(1)=1460.25
  • μ=\dfrac{∑fm}{∑f}=\dfrac{1460.25}{19}=76.86

Try It 2.31
Maris conducted a study on the effect that playing video games has on memory recall. As part of her study, she compiled the following data:

Hours Teenagers Spend on Video Games Number of Teenagers
0–3.5 3
3.5–7.5 7
7.5–11.5 12
11.5–15.5 7
15.5–19.5 9

Table 2.29

What is the best estimate for the mean number of hours spent playing video games?

Skewness and the Mean, Median, and Mode

Consider the following data set:

4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10

This data set can be represented by the following histogram. Each interval has width 1, and each value is located in the middle of an interval.

This histogram matches the supplied data. It consists of 7 adjacent bars with the x-axis split into intervals of 1 from 4 to

Figure 2.18

The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each seven for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal), and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median.

The histogram for the data: 4, 5, 6, 6, 6, 7, 7, 7, 7, 8 is not symmetrical. The right-hand side seems chopped off compared to the left-hand side. A distribution of this type is called skewed to the left because it is pulled out to the left. A skewed left distribution has more high values.

This histogram matches the supplied data. It consists of 5 adjacent bars with the x-axis split into intervals of 1 from 4 to

Figure 2.19

The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so. The mean is pulled toward the tail in a skewed distribution.

The histogram for the data: 6, 7, 7, 7, 7, 8, 8, 8, 9, 10 is also not symmetrical. It is skewed to the right. A skewed right distribution has more low values.

This histogram matches the supplied data. It consists of 5 adjacent bars with the x-axis split into intervals of 1 from 6 to

Figure 2.20

The mean is 7.7, the median is 7.5, and the mode is seven. Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.

To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.

Skewness and symmetry become important when we discuss probability distributions in later chapters.


Example 2.32

Statistics are used to compare and sometimes identify authors. The following lists show a simple random sample that compares the letter counts for three authors.

Terry: 7, 9, 3, 3, 3, 4, 1, 3, 2, 2

Davis: 3, 3, 3, 4, 1, 4, 3, 2, 3, 1

Maris: 2, 3, 4, 4, 4, 6, 6, 6, 8, 3

  1. Make a dot plot for the three authors and compare the shapes.
  2. Calculate the mean for each.
  3. Calculate the median for each.
  4. Describe any pattern you notice between the shape and the measures of center.


Solution
  1.         This dot plot matches the supplied data for Terry. The plot uses a number line from 1 to 10. It shows one x over 1, two x
    Figure 2.21 Terry's distribution has a right (positive) skew.


  2.     This dot plot matches the supplied data for Davi. The plot uses a number line from 1 to 10. It shows two x's over 1, one
    Figure 2.22 Davis's distribution has a left (negative) skew.

  3.     This dot plot matches the supplied data for Mari. The plot uses a number line from 1 to 10. It shows one x over 2, two x'
    Figure 2.23 Maris's distribution is symmetrically shaped.

  4. Terry's mean is 3.7, Davis's mean is 2.7, and Maris's mean is 4.6.

  5. Terry's median is 3, Davis's median is 3, and Maris's median is four. It would be helpful to manually calculate these descriptive statistics, using the given data sets and then compare to the graphs.

  6. It appears that the median is always closest to the high point (the mode), while the mean tends to be farther out on the tail. In a symmetrical distribution, the mean and the median are both centrally located close to the high point of the distribution.


Try It 2.32

Discuss the mean, median, and mode for each of the following problems. Is there a pattern between the shape and measure of the center?

a.

This dot plot matches the supplied data. The plot uses a number line from 0 to 14. It shows two x's over 0, four x's over 1,


Figure 2.24

b.

The Ages at Which Former U.S. Presidents Died
4 6 9
5 3 6 7 7 7 8
6 0 0 3 3 4 4 5 6 7 7 7 8
7 0 1 1 2 3 4 7 8 8 9
8 0 1 3 5 8
9 0 0 3 3
Key: 8|0 means 80.

Table 2.30

c.
This is a histogram titled Hours Spent Playing Video Games on Weekends. The x-axis shows the number of hours spent playing vi
Figure 2.25

Measures of the Spread of the Data

An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean.


The standard deviation

  • provides a numerical measure of the overall amount of variation in a data set and
  • can be used to determine whether a particular data value is close to or far from the mean.

The standard deviation provides a measure of the overall variation in a data set.

The standard deviation is always positive or zero. The standard deviation is small when all the data are concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.

Suppose that we are studying the amount of time customers wait in line at the checkout at Supermarket A and Supermarket B. The average wait time at both supermarkets is five minutes. At Supermarket A, the standard deviation for the wait time is two minutes; at Supermarket B, the standard deviation for the wait time is four minutes.

Because Supermarket B has a higher standard deviation, we know that there is more variation in the wait times at Supermarket B. Overall, wait times at Supermarket B are more spread out from the average whereas wait times at Supermarket A are more concentrated near the average.


The standard deviation can be used to determine whether a data value is close to or far from the mean.

Suppose that both Rosa and Binh shop at Supermarket A. Rosa waits at the checkout counter for seven minutes, and Binh waits for one minute. At Supermarket A, the mean waiting time is five minutes, and the standard deviation is two minutes. The standard deviation can be used to determine whether a data value is close to or far from the mean. A z-score is a standardized score that lets us compare data sets. It tells us how many standard deviations a data value is from the mean and is calculated as the ratio of the difference in a particular score and the population mean to the population standard deviation.

We can use the given information to create the table below.

Supermarket Population Standard Deviation, σ Individual Score, x Population Mean, μ
Supermarket A 2 minutes 7, 1 5
Supermarket B 4 minutes 5

Table 2.31

Since Rosa and Binh only shop at Supermarket A, we can ignore the row for Supermarket B.

We need the values from the first row to determine the number of standard deviations above or below the mean each individual wait time is; we can do so by calculating two different z-scores.

Rosa waited for seven minutes, so the z-score representing this deviation from the population mean may be calculated as

z=\dfrac{x−μ}{σ}=\dfrac{7−5}{2}=1.

The z-score of one tells us that Rosa's wait time is one standard deviation above the mean wait time of five minutes.

Binh waited for one minute, so the z-score representing this deviation from the population mean may be calculated as

z=\dfrac{x−μ}{σ}=\dfrac{1−5}{2}=−2.

The z-score of −2 tells us that Binh's wait time is two standard deviations below the mean wait time of five minutes.

A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average. Considering data to be far from the mean if they are more than two standard deviations away is more of an approximate rule of thumb than a rigid rule. In general, the shape of the distribution of the data affects how much of the data is farther away than two standard deviations. You will learn more about this in later chapters.

The number line may help you understand standard deviation. If we were to put five and seven on a number line, seven is to the right of five. We say, then, that seven is one standard deviation to the right of five because 5 + (1)(2) = 7.

If one were also part of the data set, then one is two standard deviations to the left of five because 5 + (–2)(2) = 1.

This shows a number line in intervals of 1 from 0 to 7.

Figure 2.26

  • In general, a value = mean + (#ofSTDEV)(standard deviation)
  • where #ofSTDEVs = the number of standard deviations
  • #ofSTDEV does not need to be an integer
  • One is two standard deviations less than the mean of five because 1 = 5 + (–2)(2).
The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for a sample and for a population as follows:

  • Sample: x = \overline x + \text{(#ofSTDEV)(s)}
  • Population:  x=μ+\text{(#ofSTDEV)(σ)} .

The lowercase letter s represents the sample standard deviation and the Greek letter σ (lower case) represents the population standard deviation.

The symbol  \overline x is the sample mean, and the Greek symbol μ is the population mean.


Calculating the Standard Deviation

If x is a number, then the difference x – mean is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong to a population, in symbols, a deviation is x – μ. For sample data, in symbols, a deviation is x –\overline x.

The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar but not identical. Therefore, the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lowercase letter s represents the sample standard deviation and the Greek letter σ (lowercase sigma) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of σ.

To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the squares of the deviations (the x –\overline x values for a sample or the x – μ values for a population). The symbol σ^2 represents the population variance; the population standard deviation σ is the square root of the population variance. The symbol s^2 represents the sample variance; the sample standard deviation s is the square root of the sample variance. You can think of the standard deviation as a special average of the deviations.

If the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n – 1, one less than the number of items in the sample.

Formulas for the Sample Standard Deviation
  • s=\sqrt{\dfrac{Σ(x−\overline x)^2}{n−1}}  or s=\sqrt{\dfrac{Σf(x−\overline x)^2}{n−1}}
  • For the sample standard deviation, the denominator is n−; that is, the sample size minus 1.

Formulas for the Population Standard Deviation
  • σ = \sqrt{\dfrac{Σ(x−μ)^2}{N}} or σ = \sqrt{\dfrac{Σf(x–μ)^2}{N}}
  • For the population standard deviation, the denominator is N, the number of items in the population.

In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f is one. If a value appears three times in the data set or population, f is three.


Types of Variability in Samples

When researchers study a population, they often use a sample, either for convenience or because it is not possible to access the entire population. Variability is the term used to describe the differences that may occur in these outcomes. Common types of variability include the following:

  • Observational or measurement variability
  • Natural variability
  • Induced variability
  • Sample variability
Here are some examples to describe each type of variability:

Example 1: Measurement variability
Measurement variability occurs when there are differences in the instruments used to measure or in the people using those instruments. If we are gathering data on how long it takes for a ball to drop from a height by having students measure the time of the drop with a stopwatch, we may experience measurement variability if the two stopwatches used were made by different manufacturers. For example, one stopwatch measures to the nearest second, whereas the other one measures to the nearest tenth of a second. We also may experience measurement variability because two different people are gathering the data. Their reaction times in pressing the button on the stopwatch may differ; thus, the outcomes will vary accordingly. The differences in outcomes may be affected by measurement variability.

Example 2: Natural variability
Natural variability arises from the differences that naturally occur because members of a population differ from each other. For example, if we have two identical corn plants and we expose both plants to the same amount of water and sunlight, they may still grow at different rates simply because they are two different corn plants. The difference in outcomes may be explained by natural variability.

Example 3: Induced variability
Induced variability is the counterpart to natural variability. This occurs because we have artificially induced an element of variation that, by definition, was not present naturally. For example, we assign people to two different groups to study memory, and we induce a variable in one group by limiting the amount of sleep they get. The difference in outcomes may be affected by induced variability.

Example 4: Sample variability
Sample variability occurs when multiple random samples are taken from the same population. For example, if I conduct four surveys of 50 people randomly selected from a given population, the differences in outcomes may be affected by sample variability.


Sampling Variability of a Statistic

The statistic of a sampling distribution was discussed in Descriptive Statistics: Measures of the Center of the Data. How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example of a standard error. The standard error is the standard deviation of the sampling distribution. In other words, it is the average standard deviation that results from repeated sampling. You will cover the standard error of the mean in the chapter The Central Limit Theorem (not now). The notation for the standard error of the mean is \dfrac{σ}{\sqrt{n}}, where σ is the standard deviation of the population and n is the size of the sample.

NOTE
In practice, use a calculator or computer software to calculate the standard deviation. If you are using a TI-83, 83+, or 84+ calculator, you need to select the appropriate standard deviation σx or sx from the summary statistics. We will concentrate on using and interpreting the information that the standard deviation gives us. However, you should study the following step-by-step example to help you understand how the standard deviation measures variation from the mean. The calculator instructions appear at the end of this example.


Example 2.33
In a fifth-grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fifth-grade students. The ages are rounded to the nearest half year.

9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11, 11.5, 11.5, 11.5

\overline x=\dfrac{9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3)}{20}=10.525

The average age is 10.53 years, rounded to two places.

The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square root of the variance. We will explain the parts of the table after calculating s.

Data Frequency Deviations Deviations2 (Frequency)(Deviations2)
x f (x ) (x )2 (f)(x )2
9 1 9 – 10.525 = –1.525 (–1.525)2 = 2.325625 1 × 2.325625 = 2.325625
9.5 2 9.5 – 10.525 = –1.025 (–1.025)2 = 1.050625 2 × 1.050625 = 2.101250
10 4 10 – 10.525 = –.525 (–.525)2 = .275625 4 × .275625 = 1.1025
10.5 4 10.5 – 10.525 = –.025 (–.025)2 = .000625 4 × .000625 = .0025
11 6 11 – 10.525 = .475 (.475)2 = .225625 6 × .225625 = 1.35375
11.5 3 11.5 – 10.525 = .975 (.975)2 = .950625 3 × .950625 = 2.851875




The total is 9.7375.

Table 2.32

The last column simply multiplies each squared deviation by the frequency for the corresponding data value.

The sample variance, s^2, is equal to the sum of the last column (9.7375) divided by the total number of data values minus one (20 – 1):

s^2=\dfrac{9.7375}{20−1}=.5125

The sample standard deviation s is equal to the square root of the sample variance:

s=\sqrt{.5125}=.715891, which is rounded to two decimal places, s = .72.

Typically, you do the calculation for the standard deviation on your calculator or computer. The intermediate results are not rounded. This is done for accuracy.

  • For the following problems, recall that value = mean + (#ofSTDEVs)(standard deviation). Verify the mean and standard deviation on a calculator or computer. Note that these formulas are derived by algebraically manipulating the z-score formulas, given either parameters or statistics.
  • For a sample: x = \overline x + \text{(#ofSTDEVs)(s)}
  • For a population: x = μ + \text{(#ofSTDEVs)(σ)}
  • For this example, use x =\overline x + \text{(#ofSTDEVs)(s)} because the data is from a sample
  1. Verify the mean and standard deviation on your calculator or computer.
  2. Find the value that is one standard deviation above the mean. Find ( \overline x+ 1s).
  3. Find the value that is two standard deviations below the mean. Find (\overline x – 2s).
  4. Find the values that are 1.5 standard deviations from (below and above) the mean.

Solution
  1. Using the TI-83, 83+, 84, 84+ Calculator

    • Clear lists L1 and L2. Press STAT 4:ClrList. Enter 2nd 1 for L1, the comma (,), and 2nd 2 for L2.
    • Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear the lists by arrowing up into the name. Press CLEAR and arrow down.
    • Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the frequencies (1, 2, 4, 4, 6, 3) into list L2. Use the arrow keys to move around.
    • Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 1), L2 (2nd 2). Do not forget the comma. Press ENTER.
    • \overline x = 10.525.
    • Use Sx because this is sample data (not a population): Sx=.715891.

  2. (\overline x+ 1s) = 10.53 + (1)(.72) = 11.25

  3. (\overline x– 2s) = 10.53 – (2)(.72) = 9.09


    • (\overline x – 1.5s) = 10.53 – (1.5)(.72) = 9.45
    • (\overline x + 1.5s) = 10.53 + (1.5)(.72) = 11.61

Try It 2.33
On a baseball team, the ages of each of the players are as follows:

21, 21, 22, 23, 24, 24, 25, 25, 28, 29, 29, 31, 32, 33, 33, 34, 35, 36, 36, 36, 36, 38, 38, 38, 40

Use your calculator or computer to find the mean and standard deviation. Then find the value that is two standard deviations above the mean.

Explanation of the standard deviation calculation shown in the table

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than is the data value 11, which is indicated by the deviations .97 and .47. A positive deviation occurs when the data value is greater than the mean, whereas a negative deviation occurs when the data value is less than the mean. The deviation is –1.525 for the data value nine. If you add the deviations, the sum is always zero. We can sum the products of the frequencies and deviations to show that the sum of the deviations is always zero.

1(−1.525)+2(−1.025)+4(−.525)+4(−.025)+6(.475)+3(.975)=0

For Example 2.33, there are n = 20 deviations. So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation.

The variance is a squared measure and does not have the same units as the data. Taking the square root solves the problem. The standard deviation measures the spread in the same units as the data.

Notice that instead of dividing by n = 20, the calculation divided by n – 1 = 20 – 1 = 19 because the data is a sample. For the sample variance, we divide by the sample size minus one (n – 1). Why not divide by n? The answer has to do with the population variance. The sample variance is an estimate of the population variance. Based on the theoretical mathematics that lies behind these calculations, dividing by (n – 1) gives a better estimate of the population variance.

NOTE
Your concentration should be on what the standard deviation tells us about the data. The standard deviation is a number that measures how far the data are spread from the mean. Let a calculator or computer do the arithmetic.

The standard deviation, s or σ, is either zero or larger than zero. Describing the data with reference to the spread is called variability. The variability in data depends on the method by which the outcomes are obtained, for example, by measuring or by random sampling. When the standard deviation is zero, there is no spread; that is, all the data values are equal to each other. The standard deviation is small when all the data are concentrated close to the mean and larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s or σ very large.

The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better feel for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation can be very helpful, but in skewed distributions, the standard deviation may not be much help. The reason is that the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be confusing, always graph your data. Display your data in a histogram or a box plot.

Example 2.34
Use the following data (first exam scores) from Susan Dean's spring precalculus class:

33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100

  1. Create a chart containing the data, frequencies, relative frequencies, and cumulative relative frequencies to three decimal places.
  2. Calculate the following to one decimal place using a TI-83+ or TI-84 calculator:
    • The sample mean
    • The sample standard deviation
    • The median
    • The first quartile
    • The third quartile
    • IQR
  3. Construct a box plot and a histogram on the same set of axes. Make comments about the box plot, the histogram, and the chart.

Solution
  1. See Table 2.33.
  2. Entering the data values into a list in your graphing calculator and then selecting Stat, Calc, and 1-Var Stats will produce the one-variable statistics you need.
  3. The x-axis goes from 32.5 to 100.5; the y-axis goes from –2.4 to 15 for the histogram. The number of intervals is 5, so the width of an interval is (100.5 – 32.5) divided by 5, equal to 13.6. Endpoints of the intervals are as follows: the starting point is 32.5, 32.5 + 13.6 = 46.1, 46.1 + 13.6 = 59.7, 59.7 + 13.6 = 73.3, 73.3 + 13.6 = 86.9, 86.9 + 13.6 = 100.5 = the ending value; no data values fall on an interval boundary.

    A hybrid image displaying both a histogram and box plot described in detail in the answer solution above.

    Figure 2.27

The long left whisker in the box plot is reflected in the left side of the histogram. The spread of the exam scores in the lower 50 percent is greater (73 – 33 = 40) than the spread in the upper 50 percent (100 – 73 = 27). The histogram, box plot, and chart all reflect this. There are a substantial number of A and B grades (80s, 90s, and 100). The histogram clearly shows this. The box plot shows us that the middle 50 percent of the exam scores (IQR = 29) are Ds, Cs, and Bs. The box plot also shows us that the lower 25 percent of the exam scores are Ds and Fs.

Data Frequency Relative Frequency Cumulative Relative Frequency
33 1 .032 .032
42 1 .032 .064
49 2 .065 .129
53 1 .032 .161
55 2 .065 .226
61 1 .032 .258
63 1 .032 .290
67 1 .032 .322
68 2 .065 .387
69 2 .065 .452
72 1 .032 .484
73 1 .032 .516
74 1 .032 .548
78 1 .032 .580
80 1 .032 .612
83 1 .032 .644
88 3 .097 .741
90 1 .032 .773
92 1 .032 .805
94 4 .129 .934
96 1 .032 .966
100 1 .032 .998 (Why isn't this value 1?)

Table 2.33

Try It 2.34
The following data show the different types of pet food that stores in the area carry:

6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12

Calculate the sample mean and the sample standard deviation to one decimal place using a TI-83+ or TI-84 calculator.


Standard deviation of Grouped Frequency Tables

Recall that for grouped data we do not know individual data values, so we cannot describe the typical value of the data with precision. In other words, we cannot find the exact mean, median, or mode. We can, however, determine the best estimate of the measures of center by finding the mean of the grouped data with the formula \text{Mean of Frequency Table}=\dfrac{∑fm}{∑f}, where f= interval frequencies and m = interval midpoints.

Just as we could not find the exact mean, neither can we find the exact standard deviation. Remember that standard deviation describes numerically the expected deviation a data value has from the mean. In simple English, the standard deviation allows us to compare how unusual individual data are when compared to the mean.

Example 2.35
Find the standard deviation for the data in Table 2.34.

Class Frequency, f Midpoint, m m2 2 fm2 Standard Deviation
0–2 1 1 1 7.58 1 3.5
3–5 6 4 16 7.58 96 3.5
6–8 10 7 49 7.58 490 3.5
9–11 7 10 100 7.58 700 3.5
12–14 0 13 169 7.58 0 3.5
15–17 2 16 256 7.58 512 3.5

Table 2.34

For this data set, we have the mean, \overline x= 7.58, and the standard deviation, s_x = 3.5. This means that a randomly selected data value would be expected to be 3.5 units from the mean. If we look at the first class, we see that the class midpoint is equal to one. This is almost two full standard deviations from the mean since 7.58 – 3.5 – 3.5 = .58. While the formula for calculating the standard deviation is not complicated, sx=\sqrt{\dfrac{f(m−\overline x)^2}{n−1}}, where s_x = sample standard deviation, \overline x= sample mean; the calculations are tedious. It is usually best to use technology when performing the calculations.

Try It 2.35
Find the standard deviation for the data from the previous example:

Class Frequency, f
0–2 1
3–5 6
6–8 10
9–11 7
12–14 0
15–17 2

Table 2.35

First, press the STAT key and select 1:Edit.

Figure three shows the words edit, calc, tests on the top of a calculator. 1: edit 2: sort A 3: Sort D, 4:Clr List, and 5:Set
Figure 2.28

Input the midpoint values into L1 and the frequencies into L2.

Figure 4 shows three columns: L1. L2.and L3. L1 shows the numbers 1, 4, 7, 10, 13 and 16. L2 shows the numbers 1, 6, 10, 7, 0
Figure 2.29

Select STAT, CALC, and 1: 1-Var Stats.

Figure 5 shows edit, calc, and tests. Calc is highlighted. It then shows the following options. 1: 1-var stats, 2: 2-var stat
Figure 2.30

Select 2nd, then 1, then, 2nd, then 2 Enter.

This calculator screen shows the following calculator steps. In line 1, X bar is equal to 7.576923077. In line 2, uppercase s
Figure 2.31

You will see displayed both a population standard deviation, σ_x, and the sample standard deviation, s_x.

Comparing Values from Different Data Sets

As explained before, a z-score allows us to compare statistics from different data sets. If the data sets have different means and standard deviations, then comparing the data values directly can be misleading.

  • For each data value, calculate how many standard deviations away from its mean the value is.
  • In symbols, the formulas for calculating z-scores become the following.

Sample         z=\dfrac{x −\overline x}{s}
Population         z=\dfrac{x − μ}{σ}

Table 2.36

As shown in the table, when only a sample mean and sample standard deviation are given, the top formula is used. When the population mean and population standard deviation are given, the bottom formula is used.

Example 2.36
Two students, John and Ali, from different high schools, wanted to find out who had the highest GPA when compared to his school. Which student had the highest GPA when compared to his school?

Student GPA School Mean GPA School Standard Deviation
John 2.85 3.0 .7
Ali 77 80 10

Solution
For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the average, for his school. Pay careful attention to signs when comparing and interpreting the answer.

z=\text{# of STDEVs}=\dfrac{\text{value –mean}}{\text{standard deviation}}=\dfrac{x+μ}{σ}

For John, z=\text{#ofSTDEVs}=\dfrac{2.85–3.0}{.7}=−0.21

For Ali, z=\text{#ofSTDEVs}=\dfrac{77−80}{10}=−0.3

John has the better GPA when compared to his school because his GPA is 0.21 standard deviations below his school's mean, while Ali's GPA is .3 standard deviations below his school's mean.

John's z-score of –.21 is higher than Ali's z-score of –.3. For GPA, higher values are better, so we conclude that John has the better GPA when compared to his school. The z-score representing John's score does not fall as far below the mean as the z-score representing Ali's score.

Try It 2.36
Two swimmers, Angie and Beth, from different teams, wanted to find out who had the fastest time for the 50-meter freestyle when compared to her team. Which swimmer had the fastest time when compared to her team?

Swimmer Time (seconds) Team Mean Time Team Standard Deviation
Angie 26.2 27.2 .8
Beth 27.3 30.1 1.4

Table 2.38

The following lists give a few facts that provide a little more insight into what the standard deviation tells us about the distribution of the data.

For any data set, no matter what the distribution of the data is, the following are true:
  • At least 75 percent of the data is within two standard deviations of the mean.
  • At least 89 percent of the data is within three standard deviations of the mean.
  • At least 95 percent of the data is within 4.5 standard deviations of the mean.
  • This is known as Chebyshev's Rule.
A bell-shaped distribution is one that is normal and symmetric, meaning the curve can be folded along a line of symmetry drawn through the median, and the left and right sides of the curve would fold on each other symmetrically.. With a bell-shaped distribution, the mean, median, and mode are all located at the same place.


For data having a distribution that is bell-shaped and symmetric, the following are true:
  • Approximately 68 percent of the data is within one standard deviation of the mean.
  • Approximately 95 percent of the data is within two standard deviations of the mean.
  • More than 99 percent of the data is within three standard deviations of the mean.
  • This is known as the Empirical Rule.
  • It is important to note that this rule applies only when the shape of the distribution of the data is bell-shaped and symmetric; we will learn more about this when studying the Normal or Gaussian probability distribution in later chapters.

Descriptive Statistics

Stats Lab

Descriptive Statistics

Student Learning Outcomes

  • The student will construct a histogram and a box plot.
  • The student will calculate univariate statistics.
  • The student will examine the graphs to interpret what the data imply.


Collect the Data

Record the number of pairs of shoes you own.

  1. Randomly survey 30 classmates about the number of pairs of shoes they own. Record their values.
  2. Construct a histogram. Make five to six intervals. Sketch the graph using a ruler and pencil and scale the axes.

    A blank graph template for use with this problem.

    Figure 2.32
  3. Calculate the following values:
    1.  \overline x= _____
    2. s = _____
  4. Are the data discrete or continuous? How do you know?
  5. In complete sentences, describe the shape of the histogram.
  6. Are there any potential outliers? List the value(s) that could be outliers. Use a formula to check the end values to determine if they are potential outliers.


Analyze the Data
  1. Determine the following values:
    1. Min = _____
    2. M = _____
    3. Max = _____
    4. Q1 = _____
    5. Q3 = _____
    6. IQR = _____
  2. Construct a box plot of data.
  3. What does the shape of the box plot imply about the concentration of data? Use complete sentences.
  4. Using the box plot, how can you determine if there are potential outliers?
  5. How does the standard deviation help you to determine concentration of the data and whether there are potential outliers?
  6. What does the IQR represent in this problem?
  7. Show your work to find the value that is 1.5 standard deviations
    1. above the mean.
    2. below the mean.