Graphing

Site: Saylor Academy
Course: MA121: Introduction to Statistics
Book: Graphing
Printed by: Guest user
Date: Tuesday, April 23, 2024, 5:42 AM

Description

Read these sections and complete the questions at the end of each section. First, we'll look at the available methods to portray distributions of quantitative variables. Then, we'll introduce the stem and leaf plot and how to capture the frequency of your data. We'll also discuss box plots for the purpose of identifying outliers and for comparing distributions and bar charts for quantitative variables. Finally, we'll talk about line graphs, which are based on bar graphs.

Quantitative Variables

Variables

  1. Stem and Leaf Displays
  2. Histograms
  3. Frequency Polygons
  4. Box Plots
  5. Box Plot Demonstration
  6. Bar Charts
  7. Line Graphs
  8. Dot Plots

As discussed in the section on variables in Chapter 1, quantitative variables are variables measured on a numeric scale. Height, weight, response time, subjective rating of pain, temperature, and score on an exam are all examples of quantitative variables. Quantitative variables are distinguished from categorical (sometimes called qualitative) variables such as favorite color, religion, city of birth, and favorite sport in which there is no ordering or measuring involved.

There are many types of graphs that can be used to portray distributions of quantitative variables. The upcoming sections cover the following types of graphs: (1) stem and leaf displays, (2) histograms, (3) frequency polygons, (4) box plots, (5) bar charts, (6) line graphs, (7) scatter plots (discussed in a different chapter), and (8) dot plots. Some graph types such as stem and leaf displays are best-suited for small to moderate amounts of data, whereas others such as histograms are best-suited for large amounts of data. Graph types such as box plots are good at depicting differences between distributions. Scatter plots are used to show the relationship between two variables.


Source: David M. Lane, https://onlinestatbook.com/2/graphing_distributions/quantitative.html
Public Domain Mark This work is in the Public Domain.

Stem and Leaf Displays

Learning Objectives

  1. Create and interpret basic stem and leaf displays
  2. Create and interpret back-to-back stem and leaf displays
  3. Judge whether a stem and leaf display is appropriate for a given data set

A stem and leaf display is a graphical method of displaying data. It is particularly useful when your data are not too numerous. In this section, we will explain how to construct and interpret this kind of graph.

As usual, an example will get us started. Consider Table 1 that shows the number of touchdown passes (TD passes) thrown by each of the 31 teams in the National Football League in the 2000 season.

Table 1. Number of touchdown passes.

37, 33, 33, 32, 29, 28, 28, 23,
22, 22, 22, 21, 21, 21, 20, 20,
19, 19, 18, 18, 18, 18, 16, 15,
14, 14, 14, 12, 12, 9, 6

A stem and leaf display of the data is shown in Figure 1. The left portion of Figure 1 contains the stems. They are the numbers 3, 2, 1, and 0, arranged as a column to the left of the bars. Think of these numbers as 10's digits. A stem of 3, for example, can be used to represent the 10's digit in any of the numbers from 30 to 39. The numbers to the right of the bar are leaves, and they represent the 1's digits. Every leaf in the graph therefore stands for the result of adding the leaf to 10 times its stem.

3|2337
2|001112223889
1|2244456888899
0|69


Figure 1. Stem and leaf display of the number of touchdown passes.

To make this clear, let us examine Figure 1 more closely. In the top row, the four leaves to the right of stem 3 are 2, 3, 3, and 7. Combined with the stem, these leaves represent the numbers 32, 33, 33, and 37, which are the numbers of TD passes for the first four teams in Table 1. The next row has a stem of 2 and 12 leaves. Together, they represent 12 data points, namely, two occurrences of 20 TD passes, three occurrences of 21 TD passes, three occurrences of 22 TD passes, one occurrence of 23 TD passes, two occurrences of 28 TD passes, and one occurrence of 29 TD passes. We leave it to you to figure out what the third row represents. The fourth row has a stem of 0 and two leaves. It stands for the last two entries in Table 1, namely 9 TD passes and 6 TD passes. (The latter two numbers may be thought of as 09 and 06).

One purpose of a stem and leaf display is to clarify the shape of the distribution. You can see many facts about TD passes more easily in Figure 1 than in Table 1. For example, by looking at the stems and the shape of the plot, you can tell that most of the teams had between 10 and 29 passing TDs, with a few having more and a few having less. The precise numbers of TD passes can be determined by examining the leaves.

We can make our figure even more revealing by splitting each stem into two parts. Figure 2 shows how to do this. The top row is reserved for numbers from 35 to 39 and holds only the 37 TD passes made by the first team in Table 1. The second row is reserved for the numbers from 30 to 34 and holds the 32, 33, and 33 TD passes made by the next three teams in the table. You can see for yourself what the other rows represent.

3 | 7
3 | 233
2 | 889
2 | 001112223
1 | 56888899
1 | 22444
0 | 69


Figure 2. Stem and leaf display with the stems split in two.

Figure 2 is more revealing than Figure 1 because the latter figure lumps too many values into a single row. Whether you should split stems in a display depends on the exact form of your data. If rows get too long with single stems, you might try splitting them into two or more parts.

There is a variation of stem and leaf displays that is useful for comparing distributions. The two distributions are placed back to back along a common column of stems. The result is a "back-to-back stem and leaf graph". Figure 3 shows such a graph. It compares the numbers of TD passes in the 1998 and 2000 seasons. The stems are in the middle, the leaves to the left are for the 1998 data, and the leaves to the right are for the 2000 data. For example, the second-to-last row shows that in 1998 there were teams with 11, 12, and 13 TD passes, and in 2000 there were two teams with 12 and three teams with 14 TD passes.

11

332
8865
44331110
987776665
321
7
4
3
3
2
2
1
1
0

7
233
889
001112223
56888899
22444
69


Figure 3. Back-to-back stem and leaf display. The left side shows the 1998 TD data and the right side shows the 2000 TD data.

Figure 3 helps us see that the two seasons were similar, but that only in 1998 did any teams throw more than 40 TD passes.

There are two things about the football data that make them easy to graph with stems and leaves. First, the data are limited to whole numbers that can be represented with a one-digit stem and a one-digit leaf. Second, all the numbers are positive. If the data include numbers with three or more digits, or contain decimals, they can be rounded to two-digit accuracy. Negative values are also easily handled. Let us look at another example.

Table 2 shows data from the case study Weapons and Aggression. Each value is the mean difference over a series of trials between the times it took an experimental subject to name aggressive words (like "punch") under two conditions. In one condition, the words were preceded by a non-weapon word such as "bug". In the second condition, the same words were preceded by a weapon word such as "gun" or "knife". The issue addressed by the experiment was whether a preceding weapon word would speed up (or prime) pronunciation of the aggressive word compared to a non-weapon priming word. A positive difference implies greater priming of the aggressive word by the weapon word. Negative differences imply that the priming by the weapon word was less than for a neutral word.

Table 2. The effects of priming (thousandths of a second).

43.2, 42.9, 35.6, 25.6, 25.4, 23.6,
20.5, 19.9, 14.4, 12.7, 11.3, 10.2,
10.0, 9.1, 7.5, 5.4, 4.7, 3.8, 2.1, 1.2,
-0.2, -6.3, -6.7, -8.8, -10.4, -10.5,
-14.9, -14.9, -15.0, -18.5, -27.4

You see that the numbers range from 43.2 to -27.4. The first value indicates that one subject was 43.2 milliseconds faster pronouncing aggressive words when they were preceded by weapon words than when preceded by neutral words. The value -27.4 indicates that another subject was 27.4 milliseconds slower pronouncing aggressive words when they were preceded by weapon words.

The data are displayed with stems and leaves in Figure 4. Since stem and leaf displays can only portray two whole digits (one for the stem and one for the leaf), the numbers are first rounded. Thus, the value 43.2 is rounded to 43 and represented with a stem of 4 and a leaf of 3. Similarly, 42.9 is rounded to 43. To represent negative numbers, we simply use negative stems. For example, the bottom row of the figure represents the number -27. The second-to-last row represents the numbers -10, -10, -15, etc. Once again, we have rounded the original values from Table 2.

 4 | 33
 3 | 6
 2 | 00456
 1 | 00134
 0 | 1245589
-0 | 0679
-1 | 005559
-2 | 7


Figure 4. Stem and leaf display with negative numbers and rounding.

Observe that the figure contains a row headed by "0" and another headed by "-0". The stem of 0 is for numbers between 0 and 9, whereas the stem of –0 is for numbers between 0 and -9. For example, the fifth row of the table holds the numbers 1, 2, 4, 5, 5, 8, 9 and the sixth row holds 0, -6, -7, and -9. Values that are exactly 0 before rounding should be split as evenly as possible between the "0" and "-0" rows. In Table 2, none of the values are 0 before rounding. The "0" that appears in the "-0" row comes from the original value of -0.2 in the table.

Although stem and leaf displays are unwieldy for large data sets, they are often useful for data sets with up to 200 observations. Figure 5 portrays the distribution of populations of 185 US cities in 1998. To be included, a city had to have between 100,000 and 500,000 residents.

4 | 899
4 | 6
4 | 4455
4 | 333
4 | 01
3 | 99
3 | 677777
3 | 55
3 | 223
3 | 111
2 | 8899
2 | 666667
2 | 444455
2 | 22333
2 | 000000
1 | 88888888888899999999999
1 | 666666777777
1 | 444444444444555555555555
1 | 2222222222222222222333333333
1 | 000000000000000111111111111111111111111111


Figure 5. Stem and leaf display of populations of 185 US cities with populations between 100,000 and 500,000 in 1998.

Since a stem and leaf plot shows only two-place accuracy, we had to round the numbers to the nearest 10,000. For example, the largest number (493,559) was rounded to 490,000 and then plotted with a stem of 4 and a leaf of 9. The fourth highest number (463,201) was rounded to 460,000 and plotted with a stem of 4 and a leaf of 6. Thus, the stems represent units of 100,000 and the leaves represent units of 10,000. Notice that each stem value is split into five parts: 0-1, 2-3, 4-5, 6-7, and 8-9.

Whether your data can be suitably represented by a stem and leaf graph depends on whether they can be rounded without loss of important information. Also, their extreme values must fit into two successive digits, as the data in Figure 5 fit into the 10,000 and 100,000 places (for leaves and stems, respectively). Deciding what kind of graph is best suited to displaying your data thus requires good judgment. Statistics is not just recipes!

Video

 

 

Questions

Question 1 out of 7.

A stem and leaf display is a good method of displaying large amounts of data.

  • True
  • False


Question 2 out of 7.

The highest value in the dataset is:

13 | 2
12 | 26
11 | 03457
10 | 011222445556788
9 | 001112334445555666777789
8 | 000011223333344444555666788999999
7 | 000011111366777888
6100012223444455556677789999
5 | 1355778899
4 | 4789
3 | 456689
2 | 59
1 | 9

Multiply stems by 10.0.


Question 3 out of 7.

13 | 2
12 | 26
11 | 03457
10 | 011222445556788
9 | 001112334445555666777789
8 | 000011223333344444555666788999999
7 | 000011111366777888
6 | 00012223444455556677789999
5 | 1355778899
4 | 4789
3 | 456689
2 | 59
1 | 9

Multiply stems by 10.0.


Question 4 out of 7.
The largest value is:

 6 | 0
5 |
4 | 6
3 | 2244
2 | 1688
1 | 23667
0 | 457
-0 | 1234679
-1 | 011
-2 | 2568

Multiply stems by 1.0.

Question 5 out of 7.
The smallest value is:

 6 | 0
5 |
4 | 6
3 | 2244
2 | 1688
1 | 23667
0 | 457
-0 | 1234679
-1 | 011
-2 | 2568

Multiply stems by 1.0.


Question 6 out of 7.

Stem and leaf displays are good for comparing two groups.

  • True
  • False


Question 7 out of 7.

Stem and leaf displays are good for comparing three groups.

  • True
  • False

Answers


  1. False: Stem and leaf displays can be unwieldy with large amounts of data because every single data value is shown in the figure.

  2. 132: The highest value is 132. You multiply the stem by 10 and add the leaf. For the highest value, this is (10)(13)+2.

  3. 4: There are four scores with a stem of four: 44, 47, 48, and 49.

  4. 6: The stems are multiplied by one, so the highest is (1)(6)+0.0.

  5. The stems are multiplied by one, so the lowest is (1)(-2) - 0.8. Note that with negative stems, the leaves are subtracted.

  6. True: Back-to-back stem and leaf displays are good for comparing two groups.

  7. False: Stem and leaf displays are not well suited for comparing three or more groups.

Histograms

Learning Objectives

  1. Create a grouped frequency distribution
  2. Create a histogram based on a grouped frequency distribution
  3. Determine an appropriate bin width

A histogram is a graphical method for displaying the shape of a distribution. It is particularly useful when there are a large number of observations. We begin with an example consisting of the scores of 642 students on a psychology test. The test consists of 197 items, each graded as "correct" or "incorrect". The students' scores ranged from 46 to 167.

The first step is to create a frequency table. Unfortunately, a simple frequency table would be too big, containing over 100 rows. To simplify the table, we group scores together as shown in Table 1.

 

Table 1. Grouped Frequency Distribution of Psychology Test Scores

Interval's Lower Limit Interval's Upper Limit Class Frequency
39.5 49.5 3
49.5 59.5 10
59.5 69.5 53
69.5 79.5 107
79.5 89.5 147
89.5 99.5 130
99.5 109.5 78
109.5 119.5 59
119.5 129.5 36
129.5 139.5 11
139.5 149.5 6
149.5 159.5 1
159.5 169.5 1

 

To create this table, the range of scores was broken into intervals, called class intervals. The first interval is from 39.5 to 49.5, the second from 49.5 to 59.5, etc. Next, the number of scores falling into each interval was counted to obtain the class frequencies. There are three scores in the first interval, 10 in the second, etc.

Class intervals of width 10 provide enough detail about the distribution to be revealing without making the graph too "choppy". More information on choosing the widths of class intervals is presented later in this section. Placing the limits of the class intervals midway between two numbers (e.g., 49.5) ensures that every score will fall in an interval rather than on the boundary between intervals.

In a histogram, the class frequencies are represented by bars. The height of each bar corresponds to its class frequency. A histogram of these data is shown in Figure 1.

Figure 1. Histogram of scores on a psychology test.


The histogram makes it plain that most of the scores are in the middle of the distribution, with fewer scores in the extremes. You can also see that the distribution is not symmetric: the scores extend to the right farther than they do to the left. The distribution is therefore said to be skewed. (We'll have more to say about shapes of distributions in the chapter " Summarizing Distributions").

In our example, the observations are whole numbers. Histograms can also be used when the scores are measured on a more continuous scale such as the length of time (in milliseconds) required to perform a task. In this case, there is no need to worry about fence-sitters since they are improbable. (It would be quite a coincidence for a task to require exactly 7 seconds, measured to the nearest thousandth of a second). We are therefore free to choose whole numbers as boundaries for our class intervals, for example, 4000, 5000, etc. The class frequency is then the number of observations that are greater than or equal to the lower bound, and strictly less than the upper bound. For example, one interval might hold times from 4000 to 4999 milliseconds. Using whole numbers as boundaries avoids a cluttered appearance, and is the practice of many computer programs that create histograms. Note also that some computer programs label the middle of each interval rather than the end points.

Histograms can be based on relative frequencies instead of actual frequencies. Histograms based on relative frequencies show the proportion of scores in each interval rather than the number of scores. In this case, the Y-axis runs from 0 to 1 (or somewhere in between if there are no extreme proportions). You can change a histogram based on frequencies to one based on relative frequencies by (a) dividing each class frequency by the total number of observations, and then (b) plotting the quotients on the Y-axis (labeled as proportion).

There is more to be said about the widths of the class intervals, sometimes called bin widths. Your choice of bin width determines the number of class intervals. This decision, along with the choice of starting point for the first interval, affects the shape of the histogram. There are some "rules of thumb" that can help you choose an appropriate width. (But keep in mind that none of the rules is perfect). Sturges' rule is to set the number of intervals as close as possible to 1+\log _{2}(\mathrm{N}), where \log _{2}(\mathrm{N}) is the base 2 log of the number of observations. The formula can also be written as 1+3.3 \log _{10}(\mathrm{N}), where \log _{10}(\mathrm{~\mathrm{N}}) is the log base 10 of the number of observations. According to Sturges' rule, 1000 observations would be graphed with 11 class intervals since 10 is the closest integer to \log _{2}(1000). We prefer the Rice rule, which is to set the number of intervals to twice the cube root of the number of observations. In the case of 1000 observations, the Rice rule yields 20 intervals instead of the 11 recommended by Sturges' rule. For the psychology test example used above, Sturges' rule recommends 10 intervals while the Rice rule recommends 17. In the end, we compromised and chose 13 intervals for Figure 1 to create a histogram that seemed clearest. The best advice is to experiment with different choices of width, and to choose a histogram according to how well it communicates the shape of the distribution.

To provide experience in constructing histograms, we have developed an interactive demonstration. The demonstration reveals the consequences of different choices of bin width and of lower boundary for the first interval.

Video

 

 

Questions

Question 1 out of 2.

When discussing a histogram, what does "bin width" mean?

  • The result of subtracting the smallest observation from the largest.
  • The average number of scores falling into each interval.
  • The amount of "skew" shown by the distribution.
  • The result of dividing the range of scores by the number of classes.


Question 2 out of 2.

If you apply Sturges' rule to a data set including 4000 observations, how many intervals should your histogram have?

  • The top of the box.
  • The base 2 logarithm of 4096 (which is close to 4000) is 13, so Sturges' rule recommends 13 + 1 or 14 intervals.
  • We should have 11 intervals because this is 1 less than the base 2 logarithm of 4000.
  • According to Sturges' rule, there should be 13 intervals because the base 2 logarithm of 4096 (which is close to 4000) is 12.

Answers

  1. The result of dividing the range of scores by the number of classes.
    Bin width is another name for the width of each class interval.

  2. According to Sturges' rule, there should be 13 intervals because the base 2 logarithm of 4096 (which is close to 4000) is 12.
    Thirteen intervals. Sturges' rule sets the number of intervals as close as possible to 1 + Log2(N), where Log2(N) is the base 2 log of the number of observations.

Frequency Polygons

Learning Objectives

  1. Create and interpret frequency polygons
  2. Create and interpret cumulative frequency polygons
  3. Create and interpret overlaid frequency polygons

Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful for comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions.

To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the middle of each class interval at the height corresponding to its frequency. Finally, connect the points. You should include one class interval below the lowest value in your data and one above the highest value. The graph will then touch the X-axis on both sides.

A frequency polygon for 642 psychology test scores shown in Figure 1 was constructed from the frequency table shown in Table 1.

Table 1. Frequency Distribution of Psychology Test Scores.

Lower Limit Upper Limit Count Cumulative Count
29.5 39.5 0 0
39.5 49.5 3 3
49.5 59.5 10 13
59.5 69.5 53 66
69.5 79.5 107 173
79.5 89.5 147 320
89.5 99.5 130 450
99.5 109.5 78 528
109.5 119.5 59 587
119.5 129.5 36 623
129.5 139.5 11 634
139.5 149.5 6 640
149.5 159.5 1 641
159.5 169.5 1 642
169.5 179.5 0 642

The first label on the X-axis is 35. This represents an interval extending from 29.5 to 39.5. Since the lowest test score is 46, this interval has a frequency of 0. The point labeled 45 represents the interval from 39.5 to 49.5. There are three scores in this interval. There are 147 scores in the interval that surrounds 85.

You can easily discern the shape of the distribution from Figure 1. Most of the scores are between 65 and 115. It is clear that the distribution is not symmetric inasmuch as good scores (to the right) trail off more gradually than poor scores (to the left). In the terminology of Chapter 3 (where we will study shapes of distributions more systematically), the distribution is skewed.

Figure 1. Frequency polygon for the psychology test scores.

cumulative frequency polygon for the same test scores is shown in Figure 2. The graph is the same as before except that the Y value for each point is the number of students in the corresponding class interval plus all numbers in lower intervals. For example, there are no scores in the interval labeled "35," three in the interval "45," and 10 in the interval "55". Therefore, the Y value corresponding to "55" is 13. Since 642 students took the test, the cumulative frequency for the last interval is 642.

Figure 2. Cumulative frequency polygon for the psychology test scores.

Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets. Figure 3 provides an example. The data come from a task in which the goal is to move a computer cursor to a target on the screen as fast as possible. On 20 of the trials, the target was a small rectangle; on the other 20, the target was a large rectangle. Time to reach the target was recorded on each trial. The two distributions (one for each target) are plotted together in Figure 3. The figure shows that, although there is some overlap in times, it generally took longer to move the cursor to the small target than to the large one.

Figure 3. Overlaid frequency polygons.

It is also possible to plot two cumulative frequency distributions in the same graph. This is illustrated in Figure 4 using the same data from the cursor task. The difference in distributions for the two targets is again evident.

Figure 4. Overlaid cumulative frequency polygons.

Note that the graphs on this page were not created in R. However, the R code shown here produces very similar graphs. Make sure to put the data files in the default directory.


R code written by David Scott



# Figure 1

tests = read.csv(file = 'psych_scores.csv')
bk = seq(40,170,10) # bin count interval
tk = seq(35,175,10) # FP "bins" edges
nuk = c( 0, hist( (tests[[1]]), bk, plot=F )$counts, 0 )
main="Frequency polygon for the psychology test scores"
plot(tk,nuk,type="l",col=4,xlab="Test Score",ylab="Frequency",lwd=2,main=main,ylim=c(0,160))
points(tk,nuk,pch=16,col=4,cex=1.5); abline(h=seq(0,160,20),lwd=.5)

# Figure 2
tests = read.csv(file = 'psych_scores.csv')
cum.nuk = cumsum(nuk)
main="Cumulative frequency polygon for the psychology test scores"
plot(tk,cum.nuk,type="l",col=4,xlab="Test Score",ylab="Cumulative Frequency", lwd=2,main=main,ylim=c(0,700))
points(tk,cum.nuk,pch=16,col=4,cex=1.5); abline(h=seq(0,700,100),lwd=.5)

# Figure 3
target = read.csv(file = 'target_size.csv')
bk = seq(400,1100,100) # bin count interval
tk = seq(350,1150,100) # FP "bins" edges
dat = target[[2]] # 1st 20 small 2nd 20 large
nuk1 = c( 0, hist( dat[ 1:20], bk, plot=F )$counts, 0 )
nuk2 = c( 0, hist( dat[21:40], bk, plot=F )$counts, 0 )
main="Overlaid Frequency polygons"
plot(tk,nuk1,type="l",col=2,xlab="Time (msec)",ylab="Frequency",lwd=2,main=main,ylim=c(0,10))
points(tk,nuk1,pch=16,col=2,cex=2); abline(h=seq(0,10,2.5),lwd=.5,lty=2)
lines(tk,nuk2,col=4); points(tk,nuk2,pch=16,cex=2,col=4)
text(1000,4,"small target",cex=1.5)
text(720,8,"large target",cex=1.5)


# Figure 4
target = read.csv(file = 'target_size.csv')
cum.nuk1 = cumsum(nuk1)
cum.nuk2 = cumsum(nuk2)
main="Overlaid cumulative frequency polygons"
plot(tk,cum.nuk1,type="l",col=2,xlab="Time (msec)", ylab="Cumulative Frequency", lwd=2,main=main,ylim=c(0,20))
points(tk,cum.nuk1,pch=16,col=2,cex=2); abline(h=seq(0,20,5))
lines(tk,cum.nuk2,col=4,lwd=2); points(tk,cum.nuk2,pch=16,col=4,cex=2)
text(850,12,"small target",cex=1.5)
text(450,18,"large target",cex=1.5)

Video

 

 

Questions

Question 1 out of 3.

A frequency polygon is very similar to a

  • histogram
  • stem and leaf display
  • listing of raw data


Question 2 out of 3.

Check all that apply. Frequency polygons are better than histograms for:

  • showing the shape of the distribution
  • comparing distributions
  • revealing the exact values in a distribution


Question 3 out of 3.

Response times were generally shorter to the


  • large target
  • small target

Answers

  1. histogram
    Frequency polygons do not list the raw data, as stem and leaf plots do. Frequency polygons are very similar to histograms, except histograms have bars and frequency polygons have dots and lines connecting the frequencies of each class interval.

  2. comparing distributions
    Frequency polygons are better at comparing distributions because two frequency polygons can be displayed in the same graph without obscuring each other. Both histograms and frequency polygons show the shape of the distribution. Neither necessarily reveals the exact values in a distribution.

  3. large target
    Almost all of the times for the large target are below 750 msec., whereas there are many longer times for the small target.

Box Plots

Learning Objectives

  1. Define basic terms including hinges, H-spread, step, adjacent value, outside value, and far out value
  2. Create a box plot
  3. Create parallel box plots
  4. Determine whether a box plot is appropriate for a given data set

We have already discussed techniques for visually representing data (see histograms and frequency polygons). In this section, we present another important graph called a box plot. Box plots are useful for identifying outliers and for comparing distributions. We will explain box plots with the help of data from an in-class experiment. As part of the "Stroop Interference Case Study," students in introductory statistics were presented with a page containing 30 colored rectangles. Their task was to name the colors as quickly as possible. Their times (in seconds) were recorded. We'll compare the scores for the 16 men and 31 women who participated in the experiment by making separate box plots for each gender. Such a display is said to involve parallel box plots.

There are several steps in constructing a box plot. The first relies on the 25th, 50th, and 75th percentiles in the distribution of scores. Figure 1 shows how these three statistics are used. For each gender, we draw a box extending from the 25th percentile to the 75th percentile. The 50th percentile is drawn inside the box. Therefore,

the bottom of each box is the 25th percentile,

the top is the 75th percentile,

and the line in the middle is the 50th percentile.

The data for the women in our sample are shown in Table 1.

Table 1. Women's times.

14
15
16
16
17
17
17
17
17
18
18
18
18
18
18
19
19
19
20
20
20
20
20
20
21
21
22
23
24
24
29


For these data, the 25th percentile is 17, the 50th percentile is 19, and the 75th percentile is 20. For the men (whose data are not shown), the 25th percentile is 19, the 50th percentile is 22.5, and the 75th percentile is 25.5.

Figure 1. The first step in creating box plots.


Before proceeding, the terminology in Table 2 is helpful.

Table 2. Box plot terms and values for women's times.

Name Formula Value
Upper Hinge 75th Percentile 20
Lower Hinge 25th Percentile 17
H-Spread Upper Hinge - Lower Hinge 3
Step 1.5 x H-Spread 4.5
Upper Inner Fence Upper Hinge + 1 Step 24.5
Lower Inner Fence Lower Hinge - 1 Step 12.5
Upper Outer Fence Upper Hinge + 2 Steps 29
Lower Outer Fence Lower Hinge - 2 Steps 8
Upper Adjacent Largest value below Upper Inner Fence 24

Lower Adjacent

Smallest value above Lower Inner Fence 14
Outside Value A value beyond an Inner Fence but not beyond an Outer Fence 29
Far Out Value A value beyond an Outer Fence None


Continuing with the box plots, we put "whiskers" above and below each box to give additional information about the spread of the data. Whiskers are vertical lines that end in a horizontal stroke. Whiskers are drawn from the upper and lower hinges to the upper and lower adjacent values (24 and 14 for the women's data).

Figure 2. The box plots with the whiskers drawn.

 

Although we don't draw whiskers all the way to outside or far out values, we still wish to represent them in our box plots. This is achieved by adding additional marks beyond the whiskers. Specifically, outside values are indicated by small "o's" and far out values are indicated by asterisks (*). In our data, there are no far out values and just one outside value. This outside value of 29 is for the women and is shown in Figure 3.

Figure 3. The box plots with the outside value shown.

There is one more mark to include in box plots (although sometimes it is omitted). We indicate the mean score for a group by inserting a plus sign. Figure 4 shows the result of adding means to our box plots.

Figure 4. The completed box plots.

Figure 4 provides a revealing summary of the data. Since half the scores in a distribution are between the hinges (recall that the hinges are the 25th and 75th percentiles), we see that half the women's times are between 17 and 20 seconds, whereas half the men's times are between 19 and 25.5. We also see that women generally named the colors faster than the men did, although one woman was slower than almost all of the men. Figure 5 shows the box plot for the women's data with detailed labels.

Figure 5. The box plot for the women's data with detailed labels.

Box plots provide basic information about a distribution. For example, a distribution with a positive skew would have a longer whisker in the positive direction than in the negative direction. A larger mean than median would also indicate a positive skew. Box plots are good at portraying extreme values and are especially good at showing differences between distributions. However, many of the details of a distribution are not revealed in a box plot, and to examine these details one should create a histogram and/or a stem and leaf display.


Here are some other examples of box plots:

Time to move the mouse over a target

The data come from a task in which the goal is to move a computer mouse to a target on the screen as fast as possible. On 20 of the trials, the target was a small rectangle; on the other 20, the target was a large rectangle. Time to reach the target was recorded on each trial. The box plots of the two distributions are shown below. You can see that although there is some overlap in times, it generally took longer to move the mouse to the small target than to the large one.


Draft lottery

In 1969 the war in Vietnam was at its height. An agency called the Selective Service was charged with finding a fair procedure to determine which young men would be conscripted ("drafted") into the U.S. military. The procedure was supposed to be fair in the sense of not favoring any culturally or economically defined subgroup of American men. It was decided that choosing "draftees" solely on the basis of a person’s birth date would be fair. A birthday lottery was thus devised. Pieces of paper representing the 366 days of the year (including February 29) were placed in plastic capsules, poured into a rotating drum, and then selected one at a time. The lower the draft number, the sooner the person would be drafted. Men with high enough numbers were not drafted at all.

The first number selected was 258, which meant that someone born on the 258th day of the year (September 14th) would be among the first to be drafted. The second number was 115, so someone born on the 115th day (April 24th) was among the second group to be drafted. All 366 birth dates were assigned draft numbers in this way.

To crate box plots, we divided the 366 days of the year into thirds. The first third goes from January 1 to May 1, the second from May 2 to August 31, and the last from September 1 to December 31. The three groups of birth dates yield three groups of draft numbers. The draft number for each birthday is the order it was picked in the drawing. The figure below contains box plots of the three sets of draft numbers. As you can see, people born later in the year tended to have lower draft numbers.



Variations on box plots

Statistical analysis programs may offer options on how box plots are created. For example, the box plots in Figure 6 are constructed from our data but differ from the previous box plots in several ways.

  1. It does not mark outliers.
  2. The means are indicated by green lines rather than plus signs.
  3. The mean of all scores is indicated by a gray line.
  4. Individual scores are represented by dots. Since the scores have been rounded to the nearest second, any given dot might represent more than one score.
  5. The box for the women is wider than the box for the men because the widths of the boxes are proportional to the number of subjects of each gender (31 women and 16 men).

Figure 6. Box plots showing the individual scores and the means.

Each dot in Figure 6 represents a group of subjects with the same score (rounded to the nearest second). An alternative graphing technique is to jitter the points. This means spreading out different dots at the same horizontal position, one dot for each subject. The exact horizontal position of a dot is determined randomly (under the constraint that different dots don't overlap exactly). Spreading out the dots helps you to see multiple occurrences of a given score. However, depending on the dot size and the screen resolution, some points may be obscured even if the points are jittererd. Figure 7 shows what jittering looks like.

Figure 7. Box plots with the individual scores jittered.

Different styles of box plots are best for different situations, and there are no firm rules for which to use. When exploring your data, you should try several ways of visualizing them. Which graphs you include in your report should depend on how well different graphs reveal the aspects of the data you consider most important.


R code

Note that the graph on this page was not created in R. However, the R code shown here produces a very similar graph. Make sure to put the data file in the default directory.

# Figure 4
 Data file for Figures 3
stroop = read.csv(file = "stroop.csv")
g=stroop$gender
g1=replace(g, g==1, "F")
g2=replace(g1, g1==2, "M")
boxplot (stroop$colors ~ g2, xlab="Gender", ylab="Time")
means= tapply(stroop$colors, g2,mean)
points(means,col="red",pch=18)

Video

 

 

Questions

Question 1 out of 6.

What is the upper hinge?


  • B
  • C
  • D
  • F


Question 2 out of 6.
What is the median?

  • D
  • E
  • F
  • G


Question 3 out of 6.
C is the


  • upper adjacent value
  • balance line
  • H-spread
  • outside line


Question 4 out of 6.
The H-spread is:


  • C-H
  • D-G
  • E-F


Question 5 out of 6.
Which of the following is/are true?


  • The median is higher than the mean.
  • There is one far out value.
  • The lowest value is H.
  • There is an outside value, but no far out value.
  • The highest value is C.


Question 6 out of 6.

Box plots are preferable to stem and leaf displays when

  • there is a large amount of data.
  • it is important to show individual data values.
  • three or more groups are to be compared.
  • the distribution is skewed.

Answers

  1. D: The upper hinge is the 75th percentile. It is the top of the box.

  2. F: The median is represented by a line through the middle of the box.

  3. Upper adjacent value. It is the largest value below the upper inner fence which is one step above the upper hinge.

  4. The H-spread is the difference between the upper hinge (D) and the lower hinge (G).

  5. The lowest value is H. There is an outside value, but no far out value.

  6. The two correct choices are (1) there is a large amount of data and (3) when more than two groups are compared. Stem and leaf displays show individual values and skew well.

Bar Charts

Learning Objectives

  1. Create and interpret bar charts
  2. Judge whether a bar chart or another graph such as a box plot would be more appropriate

In the section on qualitative variables, we saw how bar charts could be used to illustrate the frequencies of different categories. For example, the bar chart shown in Figure 1 shows how many purchasers of iMac computers were previous Macintosh users, previous Windows users, and new computer purchasers.

Figure 1. iMac buyers as a function of previous computer ownership.

In this section, we show how bar charts can be used to present other kinds of quantitative information, not just frequency counts. The bar chart in Figure 2 shows the percent increases in the Dow Jones, Standard and Poor 500 (S & P), and Nasdaq stock indexes from May 24th 2000 to May 24th 2001. Notice that both the S & P and the Nasdaq had "negative increases" which means that they decreased in value. In this bar chart, the Y-axis is not frequency but rather the signed quantity percentage increase.

Figure 2. Percent increase in three stock indexes from May 24th 2000 to May 24th 2001.

Bar charts are particularly effective for showing change over time. Figure 3, for example, shows the percent increase in the Consumer Price Index (CPI) over four three-month periods. The fluctuation in inflation is apparent in the graph.

Figure 3. Percent change in the CPI over time. Each bar represents percent increase for the three months ending at the date indicated.

Bar charts are often used to compare the means of different experimental conditions. Figure 4 shows the mean time it took one of us (DL) to move the mouse to either a small target or a large target. On average, more time was required for small targets than for large ones.

 

Figure 4. Bar chart showing the means for the two conditions.


Although bar charts can display means, we do not recommend them for this purpose. Box plots should be used instead since they provide more information than bar charts without taking up more space. For example, a box plot of the mouse-movement data is shown in Figure 5. You can see that Figure 5 reveals more about the distribution of movement times than does Figure 4.

Figure 5. Box plots of times to move the mouse to the small and large targets.


The section on qualitative variables presented earlier in this chapter discussed the use of bar charts for comparing distributions. Some common graphical mistakes were also noted. The earlier discussion applies equally well to the use of bar charts to display quantitative variables.

 

R code
Note that the graph on this page was not created in R. However, the R code shown here produces a very similar graph.
# Figure 3
cpi = c(3.8,2.8,4.2,2.5)
M <- c("July 2000", "October 2000", "January 2001","April 2001")
barplot(cpi,names.arg=M,xlab="Date",ylab="CPI % Increase",col="blue")

Video

 

 

Questions

Question 1 out of 2.

Bar charts can only be used for qualitative variables.

  • True
  • False


Question 2 out of 2.

Although bar charts are often used to show means, this text recommends that _________ be used instead.

  • stem and leaf plots
  • frequency tables
  • histograms
  • box plots

Answers

  1. False
    Although bar charts can be used for qualitative variables, they can also portray quantitative variables.

  2. box plots
    Box plots contain more information and take up no more space. They are therefore preferable to bar charts.

Line Graphs

Learning Objectives

  1. Create and interpret line graphs
  2. Judge whether a line graph would be appropriate for a given data set

A line graph is a bar graph with the tops of the bars represented by points joined by lines (the rest of the bar is suppressed). For example, Figure 1 was presented in the section on bar charts and shows changes in the Consumer Price Index (CPI) over time.

Figure 1. A bar chart of the percent change in the CPI over time. Each bar represents percent increase for the three months ending at the date indicated.

A line graph of these same data is shown in Figure 2. Although the figures are similar, the line graph emphasizes the change from period to period.

Figure 2. A line graph of the percent change in the CPI over time. Each point represents percent increase for the three months ending at the date indicated.


Line graphs are appropriate only when both the X- and Y-axes display ordered (rather than qualitative) variables. Although bar graphs can also be used in this situation, line graphs are generally better at comparing changes over time. Figure 3, for example, shows percent increases and decreases in five components of the Consumer Price Index (CPI). The figure makes it easy to see that medical costs had a steadier progression than the other components. Although you could create an analogous bar chart, its interpretation would not be as easy.

Figure 3. A line graph of the percent change in five components of the CPI over time.

Let us stress that it is misleading to use a line graph when the X-axis contains merely qualitative variables. Figure 4 inappropriately shows a line graph of the card game data from Yahoo, discussed in the section on qualitative variables. The defect in Figure 4 is that it gives the false impression that the games are naturally ordered in a numerical way.

Figure 4. A line graph, inappropriately used, depicting the number of people playing different card games on Sunday and Wednesday.

R code

Note that the graphs on this page were not created in R. However, the R code shown here produces a very similar graph.

# Figure 3
food=c(4.1,2.4,2.6,3.6)
housing=c(4.9,4.3,6.7,2.1)
medical = c(4.2,4.5,4.8,5.3)
rec = c(3.3,0.8,1.2,3.5)
tran = c(4.8,0,2.3,1.6)

    
plot(housing, type="o", xaxt="none",col="purple", xlab="Date", ylab="CPI % Increase",ylim=c(0,7))
lines(food,type="o",col="blue")
lines(medical,type="o",col="green")
lines(rec,type="o",col="red")
lines(tran,type="o",col="black")

    
legend("topleft",
legend=c("housing","food","medical","recreation","transportation"),
col=c("blue","violet","green","red","black"),
lty=1,lwd=2)
axis(1, at=1:4, lab=c("July 2000", "October 2000", "January 2001","April 2001"))

Video

 

 

Questions

Question 1 out of 2.

Line graphs are most similar to

  • bar charts.
  • histograms.
  • stem and leaf displays.
  • frequency polygons.


Question 2 out of 2.

Line graphs should be avoided when

  • there are more than 10 values on the X-axis.
  • the variable on the X-axis is a qualitative variable.
  • data from more than 3 groups are compared.

Answers

  1. bar charts
    A line graph is a bar graph with the tops of the bars represented by points joined by lines.

  2. the variable on the X-axis is a qualitative variable.
    Line graphs should not be used with a qualitative variable because line graphs are designed to show trends.