Introduction to Statistical Analysis
Site: | Saylor Academy |
Course: | BUS610: Advanced Business Intelligence and Analytics |
Book: | Introduction to Statistical Analysis |
Printed by: | Guest user |
Date: | Monday, 19 May 2025, 3:41 PM |
Description
Read this overview of statistical analysis. While it can be a form of analysis used by intelligence analysts, what differentiates us is the types of data we often find ourselves using. Sometimes it is detailed internal data already neatly cleaned and standardized, but more often, it is qualitative data that we have to normalize ourselves. This is the
Table of contents
- 1. Why Do We Need to Study Statistical Analysis as Part of a Business Program?
- 2. Measuring Data
- 3. Measures of Spread and Data
- 3.1 Variance and Standard Deviation
- 3.2 Descriptive Statistics
- 3.2.1 Display Data
- 3.2.2 Measures of the Location of the Data
- 3.2.3 Measures of the Center of the Data
- 3.2.4 Sigma Notation and Calculating the Arithmetic Mean
- 3.2.5 Geometric Mean
- 3.2.6 Skewness and the Mean, Median, and Mode
- 3.2.7 Measures of the Spread of the Data
- 4. Spreadsheet Exercises: Measures of Central Tendency and Spread
- 5. Spreadsheet Exercises: Graphs of Histograms and Frequency Tables
1. Why Do We Need to Study Statistical Analysis as Part of a Business Program?
Why Do We Need to Study Statistical Analysis as Part of a Business Program?
You may be wondering why you should take a course in Statistical Analysis as part of a business program. After all, you are not taking this major in order to become a mathematician! Won't there be other people you can rely on to help you make sense of numbers? Maybe you will be lucky enough to have someone working with you that can do this for you.
In the business world, you will be faced with many decisions to make every day. In many cases, these decisions will not be made in a vacuum; you will have information available to you that is relevant to the decision. In the future, you may find yourself in a position to make a key decision for your business. But will you know how to use the information available? Will you know the right questions to ask?
Consider the following scenarios:
- You manage a portfolio of investments and are presented with a research report that shows that over the past five years, technology stocks have outperformed utility stocks by 13.7% to 10.1%. Is this a significantly higher return or just the results of random chance? Should you think about investing in technology stocks? What questions would you ask about this research? In this course, you will learn how to compare two values against each other and determine if they differ in any significant way.
- You manage a line of consumer products and want to change your marketing campaign. You decide that you want to test a new campaign against the existing one to see if the new one is any better. How would you design the test? What sort of data would you collect? How much better does the new campaign need to be to justify the change in marketing plans? In this course, you will learn how to design research and how to determine if the results are significant.
- An online research survey shows you that 57% of your competitor's customers would switch to your product if you offered a coupon. Should you create a coupon? What factors would you consider? Can you trust an online survey? In this course, you will learn how to read survey results and ask the right questions about them.
- You are the manager in a company that finances automobile purchases. What variables should you use to predict the price of next year's cars? Variables such as weight, horsepower, towing capacity, and miles per gallon have all been shown to have an impact on the price. Which one is the most likely predictor of price? Is this even possible? In this course, you will learn how to determine how variables relate to each other and how to make predictions based on them.
- Your company's data warehouse collects sales data and reports it to all the company's directors and vice-presidents on a daily basis. At the next director's meeting, it will be your job to make some recommendations based on these numbers. How will
you analyze the data? In this course, you will learn how to read and analyze data sets.
These are just some of the many skills you will learn in this course. While you may not consider yourself to be an expert in Statistical Analysis when you are done, you will have the skills you need to answer these questions and the many others that you will be faced with in your daily business activities.
Source: Saylor Academy This work is licensed under a Creative Commons Attribution 3.0 License.
2. Measuring Data
2.1 Definitions of Statistics, Probability, and Key Terms
The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We see and use data in our everyday lives.
In this course, you will learn how to organize and summarize data.
Organizing and summarizing data is called descriptive statistics. Two ways to summarize data are by graphing and by using numbers (for example, finding an average). After you have studied probability and probability distributions,
you will use formal methods for drawing conclusions from "good" data. The formal methods are called inferential statistics. Statistical inference uses probability to determine how confident we can be that our conclusions are correct.
Effective
interpretation of data (inference) is based on good procedures for producing data and thoughtful examination of the data. You will encounter what will seem to be too many mathematical formulas for interpreting data. The goal of statistics is not to
perform numerous calculations using the formulas, but to gain an understanding of your data. The calculations can be done using a calculator or a computer. The understanding must come from you. If you can thoroughly grasp the basics of statistics,
you can be more confident in the decisions you make in life.
Probability
Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of an event occurring. For example, if you toss a fair coin four times, the outcomes may not be two heads and two
tails. However, if you toss the same coin 4,000 times, the outcomes will be close to half heads and half tails. The expected theoretical probability of heads in any one toss is \(\frac{1}{2}\) or 0.5. Even though the outcomes of a few repetitions
are uncertain, there is a regular pattern of outcomes when there are many repetitions. After reading about the English statistician Karl Pearson who tossed a coin 24,000 times with a result of 12,012 heads, one of the authors tossed
a coin 2,000 times. The results were 996 heads. The fraction \(\frac{996}{2000}\) is equal to 0.498 which is very close to 0.5, the expected probability.
The theory of probability began with the study of games of chance such as poker. Predictions
take the form of probabilities. To predict the likelihood of an earthquake, of rain, or whether you will get an A in this course, we use probabilities. Doctors use probability to determine the chance of a vaccination causing the disease the vaccination
is supposed to prevent. A stockbroker uses probability to determine the rate of return on a client's investments. You might use probability to decide to buy a lottery ticket or not. In your study of statistics, you will use the power of mathematics
through probability calculations to analyze and interpret your data.
Key Terms
In statistics, we generally want to study a population. You can think of a population as a collection of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling is
to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an
entire population, sampling is a very practical technique. If you wished to compute the overall grade point average at your school, it would make sense to select a sample of students who attend the school. The data collected from the sample would
be the students' grade point averages. In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion poll is supposed to represent the views of the people in the entire country. Manufacturers of canned carbonated drinks
take samples to determine if a 16 ounce can contains 16 ounces of carbonated drink.
From the sample data, we can calculate a statistic. A statistic is a number that represents a property of the sample. For example, if we consider
one math class to be a sample of the population of all math classes, then the average number of points earned by students in that one math class at the end of the term is an example of a statistic. The statistic is an estimate of a population parameter,
in this case the mean. A parameter is a numerical characteristic of the whole population that can be estimated by a statistic. Since we considered all math classes to be the population, then the average number of points earned per
student over all the math classes is an example of a parameter.
One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the sample represents the population.
The sample must contain the characteristics of the population in order to be a representative sample. We are interested in both the sample statistic and the population parameter in inferential statistics. In a later chapter, we will
use the sample statistic to test the validity of the established population parameter.
A variable, or random variable, usually notated by capital letters such as \(x\) and \(y\), is a characteristic or measurement that can
be determined for each member of a population. Variables may be numerical or categorical. Numerical variables take on values with equal units such as weight in pounds and time in hours. Categorical variables place the person or thing into a category. If we let \(x\) equal the number of points earned by one math student at the end of a term, then \(x\) is a numerical variable. If we let \(y\) be a person's party affiliation, then some examples
of \(y\) include Republican, Democratic, and Independent. \(y\) is a categorical variable. We could do some math with values of \(x\) (calculate the average number of points earned, for example), but it makes no sense to do math with
values of \(y\) (calculating an average party affiliation makes no sense).
Data are the actual values of the variable. They may be numbers or they may be words. Datum is a single value.
Two words
that come up often in statistics are mean and proportion. If you were to take three exams in your math classes and obtain scores of 86, 75, and 92, you would calculate your mean score by adding the three exam scores
and dividing by three (your mean score would be 84.3 to one decimal place). If, in your math class, there are 40 students and 22 are men and 18 are women, then the proportion of men students is \(\frac{22}{40}\) and the proportion of women students
is \(\frac{18}{40}\). Mean and proportion are discussed in more detail in later chapters.
NOTE: The words "mean" and "average" are often used interchangeably. The substitution of one word for the other is common practice. The technical term is "arithmetic mean," and "average" is technically a center location. However, in practice among non-statisticians, "average" is commonly accepted for "arithmetic mean".
EXAMPLE 1.1
Determine what the key terms refer to in the following study. We want to know the average (mean) amount of money first year college students spend at ABC College on school supplies that do not include books. We randomly surveyed 100 first year students
at the college. Three of those students spent $150, $200, and $225, respectively.
Solution 1.1
The population is all first year students attending ABC College this term.
The sample could
be all students enrolled in one section of a beginning statistics course at ABC College (although this sample may not represent the entire population).
The parameter is the average (mean) amount of money spent (excluding books)
by first year college students at ABC College this term: the population mean.
The statistic is the average (mean) amount of money spent (excluding books) by first year college students in the sample.
The variable could
be the amount of money spent (excluding books) by one first year student. Let \(x=\) the amount of money spent (excluding books) by one first year student attending ABC College.
The data are the dollar amounts spent by the
first year students. Examples of the data are $150, $200, and $225.
TRY IT 1.1
Determine what the key terms refer to in the following study. We want to know the average (mean) amount of money spent on school uniforms each year by families with children at Knoll Academy. We randomly survey 100 families with children in the school.
Three of the families spent $65, $75, and $95, respectively.
EXAMPLE 1.2
Determine what the key terms refer to in the following study.
A study was conducted at a local college to analyze the average cumulative GPA's of students who graduated last year. Fill in the letter of the phrase that best describes each of the items
below.
1. Population ____ 2. Statistic ____ 3. Parameter ____ 4. Sample ____ 5. Variable ____ 6. Data ____
- all students who attended the college last year
- the cumulative GPA of one student who graduated from the college last year
- 3.65, 2.80, 1.50, 3.90
- a group of students who graduated from the college last year, randomly selected
- the average cumulative GPA of students who graduated from the college last year
- all students who graduated from the college last year
- the average cumulative GPA of students in the study who graduated from the college last year
Solution 1.2
1. f; 2. g; 3. e; 4. d; 5. b; 6. c
EXAMPLE 1.3
Determine what the key terms refer to in the following study. As part of a study designed to test the safety of automobiles, the National Transportation Safety Board collected and reviewed data about the effects of an automobile crash on test dummies.
Here is the criterion they used:
Speed at which cars crashed | Location of "drive" (i.e. dummies) |
---|---|
35 miles/hour | Front Seat |
Table 1.1
Cars with dummies in the front seats were crashed into a wall at a speed of 35 miles per hour. We want to know the proportion of dummies in the driver's seat that would have had head injuries, if they had been actual drivers. We start with a simple random
sample of 75 cars.
Solution 1.3
The population is all cars containing dummies in the front seat.
The sample is the 75 cars, selected by a simple random sample.
The parameter is
the proportion of driver dummies (if they had been real people) who would have suffered head injuries in the population.
The statistic is proportion of driver dummies (if they had been real people) who would have suffered head
injuries in the sample.
The variable \(x =\) the number of driver dummies (if they had been real people) who would have suffered head injuries.
The data are either: yes, had head injury, or no, did not.
EXAMPLE 1.4
Determine what the key terms refer to in the following study.
An insurance company would like to determine the proportion of all medical doctors who have been involved in one or more malpractice lawsuits. The company selects 500 doctors at
random from a professional directory and determines the number in the sample who have been involved in a malpractice lawsuit.
Solution 1.4
The population is all medical doctors listed in the professional
directory.
The parameter is the proportion of medical doctors who have been involved in one or more malpractice suits in the population.
The sample is the 500 doctors selected at random from the
professional directory.
The statistic is the proportion of medical doctors who have been involved in one or more malpractice suits in the sample.
The variable \(x=\) the number of medical doctors
who have been involved in one or more malpractice suits.
The data are either: yes, was involved in one or more malpractice lawsuits, or no, was not.
2.2 Kinds of Data in Statistics
Watch the first video from 1:03:00 to the end, and the second video until 54:00. These videos explain the foundations of statistics, what data is, and the various types of data we'll be exploring.
- Video 1: Philip Stark's "Statistics 21 - Lecture 2"
- Video 2: Philip Stark's "Statistics 21 - Lecture 3"

3. Measures of Spread and Data
3.1 Variance and Standard Deviation
- Video 1: Variance of a Population
- Video 2: Sample Variance
- Video 3: Standard Deviation
Source: Khan Academy This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.
3.2 Descriptive Statistics
Introduction
Figure 2.1 When you have large amounts of data, you will need to organize it in a way that makes sense. These ballots from an election are rolled together with similar ballots to keep them organized.
Once you have collected data, what will you do with it? Data can be described and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the house prices, so you
might ask your real estate agent to give you a sample data set of prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look at the median price and the variation of prices. The median and variation are just
two ways that you will learn to describe data. Your agent might also provide you with a graph of the data.
In this chapter, you will study numerical and graphical ways to describe and display your data. This area of statistics is called "Descriptive
Statistics". You will learn how to calculate, and even more importantly, how to interpret these measurements and graphs.
A statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population. A graph
can be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and to enable readers to compare facts and
figures quickly. Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied.
Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar graph, the histogram,
the stem-and-leaf plot, the frequency polygon (a type of broken line graph), the pie chart, and the box plot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs, and bar graphs, as well as frequency polygons, and time series
graphs. Our emphasis will be on histograms and box plots.
Source: Openstax, https://openstax.org/books/introductory-business-statistics/pages/2-introduction
3.2.1 Display Data
Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs
One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. For example, 23 has stem two and leaf three. The number 432 has stem 43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and leaf three. Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.
Example 2.1
For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to largest):
Stem | Leaf |
---|---|
3 | 3 |
4 | 2 9 9 |
5 | 3 5 5 |
6 | 1 3 7 8 8 9 9 |
7 | 2 3 4 8 |
8 | 0 3 8 8 8 |
9 | 0 2 4 4 4 4 6 |
10 | 0 |
Table 2.1 Stem-and-Leaf Graph
Try It 2.1
For the Park City basketball team, scores for the last 30 games were as follows (smallest to largest):
32; 32; 33; 34; 38; 40; 42; 42; 43; 44; 46; 47; 47; 48; 48; 48; 49; 50; 50; 51; 52; 52; 52; 53; 54; 56; 57; 57; 60; 61
Construct a stem plot for the data.
The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers, so we will cover them in more detail later.
Example 2.2
The data are the distances (in kilometers) from a home to local supermarkets. Create a stemplot using the data:
1.1; 1.5; 2.3; 2.5; 2.7; 3.2; 3.3; 3.3; 3.5; 3.8; 4.0; 4.2; 4.5; 4.5; 4.7; 4.8; 5.5; 5.6; 6.5; 6.7; 12.3
Do the data seem to have any concentration of values?
NOTE
The leaves are to the right of the decimal.Solution 2.2
The value 12.3 may be an outlier. Values appear to concentrate at three and four kilometers.
Stem | Leaf |
---|---|
1 | 1 5 |
2 | 3 5 7 |
3 | 2 3 3 5 8 |
4 | 0 2 5 5 7 8 |
5 | 5 6 |
6 | 5 7 |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | 3 |
Table 2.2
Try It 2.2
The following data show the distances (in miles) from the homes of off-campus statistics students to the college. Create a stem plot using the data and identify any outliers:0.5; 0.7; 1.1; 1.2; 1.2; 1.3; 1.3; 1.5; 1.5; 1.7; 1.7; 1.8; 1.9; 2.0; 2.2; 2.5; 2.6; 2.8; 2.8; 2.8; 3.5; 3.8; 4.4; 4.8; 4.9; 5.2; 5.5; 5.7; 5.8; 8.0
Example 2.3
A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the stems. Table 2.4
and Table 2.5 show the ages of presidents at their inauguration and at their death. Construct a side-by-side stem-and-leaf plot using this data.
Ages at Inauguration | Ages at Death | |
---|---|---|
9 9 8 7 7 7 6 3 2 | 4 | 6 9 |
8 7 7 7 7 6 6 6 5 5 5 5 4 4 4 4 4 2 2 1 1 1 1 1 0 | 5 | 3 6 6 7 7 8 |
9 8 5 4 4 2 1 1 1 0 | 6 | 0 0 3 3 4 4 5 6 7 7 7 8 |
7 | 0 0 1 1 1 4 7 8 8 9 | |
8 | 0 1 3 5 8 | |
9 | 0 0 3 3 |
President | Age | President | Age | President | Age |
---|---|---|---|---|---|
Washington | 57 | Lincoln | 52 | Hoover | 54 |
J. Adams | 61 | A. Johnson | 56 | F. Roosevelt | 51 |
Jefferson | 57 | Grant | 46 | Truman | 60 |
Madison | 57 | Hayes | 54 | Eisenhower | 62 |
Monroe | 58 | Garfield | 49 | Kennedy | 43 |
J. Q. Adams | 57 | Arthur | 51 | L. Johnson | 55 |
Jackson | 61 | Cleveland | 47 | Nixon | 56 |
Van Buren | 54 | B. Harrison | 55 | Ford | 61 |
W. H. Harrison | 68 | Cleveland | 55 | Carter | 52 |
Tyler | 51 | McKinley | 54 | Reagan | 69 |
Polk | 49 | T. Roosevelt | 42 | G.H.W. Bush | 64 |
Taylor | 64 | Taft | 51 | Clinton | 47 |
Fillmore | 50 | Wilson | 56 | G. W. Bush | 54 |
Pierce | 48 | Harding | 55 | Obama | 47 |
Buchanan | 65 | Coolidge | 51 |
President | Age | President | Age | President | Age |
---|---|---|---|---|---|
Washington | 67 | Lincoln | 56 | Hoover | 90 |
J. Adams | 90 | A. Johnson | 66 | F. Roosevelt | 63 |
Jefferson | 83 | Grant | 63 | Truman | 88 |
Madison | 85 | Hayes | 70 | Eisenhower | 78 |
Monroe | 73 | Garfield | 49 | Kennedy | 46 |
J. Q. Adams | 80 | Arthur | 56 | L. Johnson | 64 |
Jackson | 78 | Cleveland | 71 | Nixon | 81 |
Van Buren | 79 | B. Harrison | 67 | Ford | 93 |
W. H. Harrison | 68 | Cleveland | 71 | Reagan | 93 |
Tyler | 71 | McKinley | 58 | ||
Polk | 53 | T. Roosevelt | 60 | ||
Taylor | 65 | Taft | 72 | ||
Fillmore | 74 | Wilson | 67 | ||
Pierce | 64 | Harding | 57 | ||
Buchanan | 77 | Coolidge | 60 |
Another type of graph that is useful for specific data values is a line graph. In the particular line graph shown in Example 2.4, the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points. The frequency points are connected using line segments.
Example 2.4
Number of times teenager is reminded | Frequency |
---|---|
0 | 2 |
1 | 5 |
2 | 8 |
3 | 14 |
4 | 7 |
5 | 4 |

Try It 2.4
Number of times in shop | Frequency |
---|---|
0 | 7 |
1 | 10 |
2 | 14 |
3 | 9 |
Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal. The bar graph shown in Example 2.5 has age groups represented on the x-axis and proportions on the y-axis.
Example 2.5
Age groups | Number of Facebook users | Proportion (%) of Facebook users |
---|---|---|
13–25 | 65,082,280 | 45% |
26–44 | 53,300,200 | 36% |
45–64 | 27,885,100 | 19% |

Try It 2.5
Age groups | Number of people | Proportion of population |
---|---|---|
Children | 67,059 | 19% |
Working-age adults | 152,198 | 43% |
Retirees | 131,662 | 38% |
Table 2.9
Example 2.6
Race/ethnicity | AP examinee population | Overall student population |
---|---|---|
1 = Asian, Asian American or Pacific Islander | 10.3% | 5.7% |
2 = Black or African American | 9.0% | 14.7% |
3 = Hispanic or Latino | 17.0% | 17.6% |
4 = American Indian or Alaska Native | 0.6% | 1.1% |
5 = White | 57.1% | 59.2% |
6 = Not reported/other | 6.0% | 1.7% |
Table 2.10

Try It 2.6
District | Registered voter population | Overall city population |
---|---|---|
1 | 15.5% | 19.4% |
2 | 12.2% | 15.6% |
3 | 9.8% | 9.0% |
4 | 17.4% | 18.5% |
5 | 22.8% | 20.7% |
6 | 22.3% | 16.8% |
Table 2.11
Example 2.7
|
Dogs | Cats | Fish | Total |
Men | 4 | 2 | 2 | 8 |
Women | 4 | 6 | 2 | 12 |
Total | 8 | 8 | 4 | 20 |
Table 2.12
Given these data, calculate the conditional distributions for the subpopulation of men who own each pet type.
Men who own dogs = 4/8 = 0.5
Men who own cats = 2/8 = 0.25
Men who own fish = 2/8 = 0.25
Note: The sum of all of the conditional distributions must equal one. In this case, 0.5 + 0.25 + 0.25 = 1; therefore, the solution "checks".
Histograms, Frequency Polygons, and Time Series Graphs
For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more.A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data.
The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample.(Remember, frequency is defined as the number of times an answer occurs) If:
- \(f\) = frequency
- \(n\) = total number of data values (or the sum of the individual frequencies), and
- \(RF\) = relative frequency,
For example, if three students in Mr. Ahab's English class of 40 students received from 90% to 100%, then, \(f = 3\), \(n = 40\), and \(RF = \frac{f}{n} = \frac{3}{40} = 0.075. 7.5%\) of the students received 90–100%. 90–100% are quantitative measures.
To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity. The number of bars needs to be chosen. Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – 0.0005 = 0.9995). If all the data happen to be integers and the smallest value is two, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary. The next two examples go into detail about how to construct a histogram using continuous data and how to create a histogram using discrete data.
Example 2.8
The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data, since height is measured.60; 60.5; 61; 61; 61.5
63.5; 63.5; 63.5
64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5
66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5
68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5
70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71
72; 72; 72; 72.5; 72.5; 73; 73.5
74
The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point.
60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95.
The largest value is 74, so 74 + 0.05 = 74.05 is the ending value.
Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you choose eight bars.
We will round up to two and make each bar or class interval two units wide. Rounding up to two is one way to prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline that is followed by some for the width of a bar or class interval is to take the square root of the number of data values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, take the square root of 150 and round to 12 bars or intervals.
The boundaries are:
- 59.95
- 59.95 + 2 = 61.95
- 61.95 + 2 = 63.95
- 63.95 + 2 = 65.95
- 65.95 + 2 = 67.95
- 67.95 + 2 = 69.95
- 69.95 + 2 = 71.95
- 71.95 + 2 = 73.95
- 73.95 + 2 = 75.95
The following histogram displays the heights on the x-axis and relative frequency on the y-axis.

Try It 2.8
The following data are the shoe sizes of 50 male students. The sizes are continuous data since shoe size is measured. Construct a histogram and calculate the width of each bar or class interval. Suppose you choose six bars.9; 9; 9.5; 9.5; 10; 10; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5
11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5
12; 12; 12; 12; 12; 12; 12; 12.5; 12.5; 12.5; 12.5; 14
Example 2.9
Create a histogram for the following data: the number of books bought by 50 part-time college students at ABC College. The number of books is discrete data, since books are counted.1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1
2; 2; 2; 2; 2; 2; 2; 2; 2; 2
3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3
4; 4; 4; 4; 4; 4
5; 5; 5; 5; 5
6; 6
Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy four books. Five students buy five books. Two students buy six books.
Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and the ending value is 6.5.
Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many different values, a width that places the data values in the middle of the bar or class interval is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6, and the starting point is 0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from _______ to _______, the 5 in the middle of the interval from _______ to _______, and the _______ in the middle of the interval from _______ to _______ .
- 3.5 to 4.5
- 4.5 to 5.5
- 6
- 5.5 to 6.5
Calculate the number of bars as follows:
where 1 is the width of a bar. Therefore, bars = 6.
The following histogram displays the number of books on the x-axis and the frequency on the y-axis.

Example 2.10
Using this data set, construct a histogram.Number of hours my classmates spent playing video games on weekends | ||||
---|---|---|---|---|
9.95 | 10 | 2.25 | 16.75 | 0 |
19.5 | 22.5 | 7.5 | 15 | 12.75 |
5.5 | 11 | 10 | 20.75 | 17.5 |
23 | 21.9 | 24 | 23.75 | 18 |
20 | 15 | 22.9 | 18.8 | 20.5 |
Table 2.13

Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if it falls on the left boundary, but not if it falls on the right boundary. Different researchers may set up histograms for the same data in different ways. There is more than one correct way to set up a histogram.
Frequency Polygons
Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons.To construct a frequency polygon, first examine the data and decide on the number of intervals, or class intervals, to use on the x-axis and y-axis. After choosing the appropriate ranges, begin plotting the data points. After all the points are plotted, draw line segments to connect them.
Example 2.11
A frequency polygon was constructed from the frequency table below.Frequency distribution for calculus final test scores | |||
---|---|---|---|
Lower bound | Upper bound | Frequency | Cumulative frequency |
49.5 | 59.5 | 5 | 5 |
59.5 | 69.5 | 10 | 15 |
69.5 | 79.5 | 30 | 45 |
79.5 | 89.5 | 40 | 85 |
89.5 | 99.5 | 15 | 100 |
Table 2.14

The first label on the x-axis is 44.5. This represents an interval extending from 39.5 to 49.5. Since the lowest test score is 54.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 54.5 represents the next interval, or the first "real" interval from the table, and contains five scores. This reasoning is followed for each of the remaining intervals with the point 104.5 representing the interval from 99.5 to 109.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side.
Try It 2.11
Age at inauguration | Frequency |
---|---|
41.5–46.5 | 4 |
46.5–51.5 | 11 |
51.5–56.5 | 14 |
56.5–61.5 | 9 |
61.5–66.5 | 4 |
66.5–71.5 | 2 |
Table 2.15
Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets.
Example 2.12
We will construct an overlay frequency polygon comparing the scores from Example 2.11 with the students' final numeric grade.Frequency distribution for calculus final test scores | |||
---|---|---|---|
Lower bound | Upper bound | Frequency | Cumulative frequency |
49.5 | 59.5 | 5 | 5 |
59.5 | 69.5 | 10 | 15 |
69.5 | 79.5 | 30 | 45 |
79.5 | 89.5 | 40 | 85 |
89.5 | 99.5 | 15 | 100 |
Table 2.16
Frequency distribution for calculus final grades | |||
---|---|---|---|
Lower bound | Upper bound | Frequency | Cumulative frequency |
49.5 | 59.5 | 10 | 10 |
59.5 | 69.5 | 10 | 20 |
69.5 | 79.5 | 30 | 50 |
79.5 | 89.5 | 45 | 95 |
89.5 | 99.5 | 5 | 100 |
Table 2.17

Constructing a Time Series Graph
Suppose that we want to study the temperature range of a region for an entire month. Every day at noon we note the temperature and write this down in a log. A variety of statistical studies could be done with these data. We could find the mean or the median temperature for the month. We could construct a histogram displaying the number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data that we have collected.One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature reading for the day, we don‘t have to think of the data as being random. We can instead use the times given to impose a chronological order on the data. A graph that recognizes this ordering and displays the changing temperature as the month progresses is called a time series graph.
To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.
Example 2.13
Year | Jan | Feb | Mar | Apr | May | Jun | Jul |
---|---|---|---|---|---|---|---|
2003 | 181.7 | 183.1 | 184.2 | 183.8 | 183.5 | 183.7 | 183.9 |
2004 | 185.2 | 186.2 | 187.4 | 188.0 | 189.1 | 189.7 | 189.4 |
2005 | 190.7 | 191.8 | 193.3 | 194.6 | 194.4 | 194.5 | 195.4 |
2006 | 198.3 | 198.7 | 199.8 | 201.5 | 202.5 | 202.9 | 203.5 |
2007 | 202.416 | 203.499 | 205.352 | 206.686 | 207.949 | 208.352 | 208.299 |
2008 | 211.080 | 211.693 | 213.528 | 214.823 | 216.632 | 218.815 | 219.964 |
2009 | 211.143 | 212.193 | 212.709 | 213.240 | 213.856 | 215.693 | 215.351 |
2010 | 216.687 | 216.741 | 217.631 | 218.009 | 218.178 | 217.965 | 218.011 |
2011 | 220.223 | 221.309 | 223.467 | 224.906 | 225.964 | 225.722 | 225.922 |
2012 | 226.665 | 227.663 | 229.392 | 230.085 | 229.815 | 229.478 | 229.104 |
Table 2.18
Year | Aug | Sep | Oct | Nov | Dec | Annual |
---|---|---|---|---|---|---|
2003 | 184.6 | 185.2 | 185.0 | 184.5 | 184.3 | 184.0 |
2004 | 189.5 | 189.9 | 190.9 | 191.0 | 190.3 | 188.9 |
2005 | 196.4 | 198.8 | 199.2 | 197.6 | 196.8 | 195.3 |
2006 | 203.9 | 202.9 | 201.8 | 201.5 | 201.8 | 201.6 |
2007 | 207.917 | 208.490 | 208.936 | 210.177 | 210.036 | 207.342 |
2008 | 219.086 | 218.783 | 216.573 | 212.425 | 210.228 | 215.303 |
2009 | 215.834 | 215.969 | 216.177 | 216.330 | 215.949 | 214.537 |
2010 | 218.312 | 218.439 | 218.711 | 218.803 | 219.179 | 218.056 |
2011 | 226.545 | 226.889 | 226.421 | 226.230 | 225.672 | 224.939 |
2012 | 230.379 | 231.407 | 231.317 | 230.221 | 229.601 | 229.594 |
Table 2.19

Try It 2.13
CO2 emissions | |||
---|---|---|---|
Year | Ukraine | United Kingdom | United States |
2003 | 352,259 | 540,640 | 5,681,664 |
2004 | 343,121 | 540,409 | 5,790,761 |
2005 | 339,029 | 541,990 | 5,826,394 |
2006 | 327,797 | 542,045 | 5,737,615 |
2007 | 328,357 | 528,631 | 5,828,697 |
2008 | 323,657 | 522,247 | 5,656,839 |
2009 | 272,176 | 474,579 | 5,299,563 |
Table 2.20
Uses of a Time Series Graph
Time series graphs are important tools in various applications of statistics. When recording values of the same variable over an extended period of time, sometimes it is difficult to discern any trend or pattern. However, once the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to spot.How NOT to Lie with Statistics
It is important to remember that the very reason we develop a variety of methods to present data is to develop insights into the subject of what the observations represent. We want to get a "sense" of the data. Are the observations all very much alike or are they spread across a wide range of values, are they bunched at one end of the spectrum or are they distributed evenly and so on. We are trying to get a visual picture of the numerical data. Shortly we will develop formal mathematical measures of the data, but our visual graphical presentation can say much. It can, unfortunately, also say much that is distracting, confusing and simply wrong in terms of the impression the visual leaves. Many years ago Darrell Huff wrote the book How to Lie with Statistics. It has been through 25 plus printings and sold more than one and one-half million copies. His perspective was a harsh one and used many actual examples that were designed to mislead. He wanted to make people aware of such deception, but perhaps more importantly to educate so that others do not make the same errors inadvertently.Again, the goal is to enlighten with visuals that tell the story of the data. Pie charts have a number of common problems when used to convey the message of the data. Too many pieces of the pie overwhelm the reader. More than perhaps five or six categories ought to give an idea of the relative importance of each piece. This is after all the goal of a pie chart, what subset matters most relative to the others. If there are more components than this then perhaps an alternative approach would be better or perhaps some can be consolidated into an "other" category. Pie charts cannot show changes over time, although we see this attempted all too often. In federal, state, and city finance documents pie charts are often presented to show the components of revenue available to the governing body for appropriation: income tax, sales tax motor vehicle taxes and so on. In and of itself this is interesting information and can be nicely done with a pie chart. The error occurs when two years are set side-by-side. Because the total revenues change year to year, but the size of the pie is fixed, no real information is provided and the relative size of each piece of the pie cannot be meaningfully compared.
Histograms can be very helpful in understanding the data. Properly presented, they can be a quick visual way to present probabilities of different categories by the simple visual of comparing relative areas in each category. Here the error, purposeful or not, is to vary the width of the categories. This of course makes comparison to the other categories impossible. It does embellish the importance of the category with the expanded width because it has a greater area, inappropriately, and thus visually "says" that that category has a higher probability of occurrence.
Time series graphs perhaps are the most abused. A plot of some variable across time should never be presented on axes that change part way across the page either in the vertical or horizontal dimension. Perhaps the time frame is changed from years to months. Perhaps this is to save space or because monthly data was not available for early years. In either case this confounds the presentation and destroys any value of the graph. If this is not done to purposefully confuse the reader, then it certainly is either lazy or sloppy work.
Changing the units of measurement of the axis can smooth out a drop or accentuate one. If you want to show large changes, then measure the variable in small units, penny rather than thousands of dollars. And of course to continue the fraud, be sure that the axis does not begin at zero, zero. If it begins at zero, zero, then it becomes apparent that the axis has been manipulated.
Perhaps you have a client that is concerned with the volatility of the portfolio you manage. An easy way to present the data is to use long time periods on the time series graph. Use months or better, quarters rather than daily or weekly data. If that doesn't get the volatility down then spread the time axis relative to the rate of return or portfolio valuation axis. If you want to show "quick" dramatic growth, then shrink the time axis. Any positive growth will show visually "high" growth rates. Do note that if the growth is negative then this trick will show the portfolio is collapsing at a dramatic rate.
Again, the goal of descriptive statistics is to convey meaningful visuals that tell the story of the data. Purposeful manipulation is fraud and unethical at the worst, but even at its best, making these type of errors will lead to confusion on the part of the analysis.
3.2.2 Measures of the Location of the Data
The common measures of location are quartiles and percentiles
Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3,
is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.
To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles
divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or
less than your score and 10% of the test scores are the same or greater than your test score.
Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges
and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That translates
into a score of at least 1220.
Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one
particular data value is not significant.
The median is a number that measures the "center" of the data. You can think of the median as the "middle value," but it does not actually have to be one of the observed values. It
is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data.
1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8;
8.3; 2; 2; 10; 1
Ordered from smallest to largest:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the
two values together and divide by two.
\(\frac{6.8+7.2}{2} =7\)
The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the
quartiles, first find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea,
consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two.
1;
1; 2; 2; 4; 6; 6.8
The number two, which is part of the data, is the first quartile. One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.
The upper
half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.
The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the
ordered data set are greater than nine. The third quartile is part of the data set in this example.
The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference
between the third quartile (Q3) and the first quartile (Q1).
IQR = Q3 – Q1
The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile. Potential outliers always require further investigation.
NOTE
A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.
Example 2.14
For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000
Solution 2.14
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000
M = \(488,800\)
Q1 = \(\frac{230,500 + 387,000}{2} = 308,750\)
Q3 = \(\frac{639,000 + 659,000}{2} = 649,000\)
IQR = \(649,000 – 308,750 = 340,250\)
\((1.5)(IQR) = (1.5)(340,250) = 510,375\)
\(Q _1 – (1.5)(IQR) = 308,750 – 510,375 = –201,625\)
\(Q_3 + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375\)
No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.
Example 2.15
For the two data sets in the test scores example, find the following:
a. The interquartile range. Compare the two interquartile ranges.
b. Any outliers in either set.
Solution 2.15
The five number summary for the day and night classes is
Minimum | Q1 | Median | Q3 | Maximum | |
---|---|---|---|---|---|
Day | 32 | 56 | 74.5 | 82.5 | 99 |
Night | 25.5 | 78 | 81 | 89 | 98 |
Table 2.21
a. The IQR for the day group is Q3 – Q1 = 82.5 – 56 = 26.5
The IQR for the night group is Q3 – Q1 = 89 – 78 = 11
The interquartile range (the spread or variability) for the day class is larger than the night class IQR. This suggests more variation will be found in the day class’s
class test scores.
b. Day class outliers are found using the IQR times 1.5 rule. So,
- Q1 - IQR(1.5) = 56 – 26.5(1.5) = 16.25
- Q3 + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25
Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers.
Night class outliers are calculated as:
- Q1 – IQR (1.5) = 78 – 11(1.5) = 61.5
- Q3 + IQR(1.5) = 89 + 11(1.5) = 105.5
For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier.
Example 2.16
Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were:
Amount of sleep per school night (hours) | Frequency | Relative frequency | Cumulative relative frequency |
---|---|---|---|
4 | 2 | 0.04 | 0.04 |
5 | 5 | 0.10 | 0.14 |
6 | 7 | 0.14 | 0.28 |
7 | 12 | 0.24 | 0.52 |
8 | 14 | 0.28 | 0.80 |
9 | 7 | 0.14 | 0.94 |
10 | 3 | 0.06 | 1.00 |
Table 2.22
Find the 28th percentile. Notice the 0.28 in the "cumulative relative frequency" column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28th percentile. They include the two 4s, the five
5s, and the seven 6s. The 28th percentile is between the last six and the first seven. The 28th percentile is 6.5.
Find the median. Look again at the "cumulative relative frequency" column and find
0.52. The median is the 50th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50th percentile
is between the 25th, or seven, and 26th, or seven, values. The median is seven.
Find the third quartile. The third quartile is the same as the 75th percentile. You can "eyeball"
this answer. If you look at the "cumulative relative frequency" column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75th percentile, then, must be an eight.
Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38th value, which is an eight. You can check this answer by counting the values. (There are 37 values
below the third quartile and 12 values above).
Try It 2.16
Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65th percentile.
Amount of time spent on route (hours) | Frequency | Relative frequency | Cumulative relative frequency |
---|---|---|---|
2 | 12 | 0.30 | 0.30 |
3 | 14 | 0.35 | 0.65 |
4 | 10 | 0.25 | 0.90 |
5 | 4 | 0.10 | 1.00 |
Table 2.23
Example 2.17
Using Table 2.22:
a. Find the 80th percentile.
b. Find the 90th percentile.
c. Find the first quartile. What is another name for the first quartile?
Solution 2.17
Using the data from the frequency table, we have:
a. The 80th percentile is between the last eight and the first nine in the table (between the 40th and 41st values). Therefore, we need to take the mean of the 40th an 41st values. The 80th percentile \(=\frac{8+9}{2}=8.5\)
b. The 90th percentile will be the 45th data value (location is 0.90(50) = 45) and the 45th data value is nine.
c. Q1 is also the 25th percentile. The 25th percentile location calculation: P25 = 0.25(50) = 12.5 ≈ 13 the 13th data value. Thus, the 25th percentile is six.
A Formula for Finding the kth Percentile
If you were to do a little research, you would find several formulas for calculating the kth percentile. Here is one of them.
\(k\) = the kth percentile. It may or may not be part of the data.
\(i\) = the index (ranking
or position of a data value)
\(n\) = the total number of data points, or observations
- Order the data from smallest to largest.
- Calculate \(i=\frac{k}{100} (n+1)\)
- If \(i\) is an integer, then the kth percentile is the data value in the ith position in the ordered set of data.
- If \(i\) is not an integer, then round \(i\) up and round \(i\) down to the nearest integers. Average the two data values in these two positions in the ordered data set. This is easier to understand in an example.
Example 2.18
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
a. Find the 70th percentile.
b. Find the 83rd percentile.
Solution 2.18
a.
- \( k = 70\)
- \(i =\) the index
- \(n = 29\)
\(i = \frac{k}{100} (n + 1) = (\frac{70}{100})(29 + 1) = 21\). Twenty-one is an integer, and the data value in the 21st position in the ordered data set is 64. The 70th percentile is 64 years.
b.- \(k = 83^{rd}\) percentile
- \(i =\) the index
- \(n = 29\)
\(i = \frac{k}{100} (n + 1) = )\frac{83}{100})(29 + 1) = 24.9\), which is NOT an integer. Round it down to 24 and up to 25. The age in the 24th position is 71 and the age in the 25th position is 72. Average 71 and 72. The
83rd percentile is 71.5 years.
Try It 2.18
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
Calculate the 20th percentile
and the 55th percentile.
A Formula for Finding the Percentile of a Value in a Data Set
- Order the data from smallest to largest.
- \(x =\) the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile.
- \(y =\) the number of data values equal to the data value for which you want to find the percentile.
- \(n =\) the total number of data.
- Calculate \(\frac{x+0.5y}{n}(100)\). Then round to the nearest integer.
Example 2.19
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
a Find the percentile for 58.
b. Find the percentile for 25.
Solution 2.19
a. Counting from the bottom of the list, there are 18 data values less than 58. There is one value of 58.
\(x = 18\) and \(y = 1.\frac{x+0.5y}{n}(100) = \frac{18+0.5(1)}{29}(100) = 63.80. 58\) is the 64th percentile.
b. Counting from the bottom of the list, there are three data values less than 25. There is one value of 25.
\(x = 3\) and \(y = 1.\frac{x+0.5y}{n}(100) = \frac{3+0.5(1)}{29}(100) = 12.07\). Twenty-five is the 12th percentile.
Interpreting Percentiles, Quartiles, and Median
A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15% of data values are less
than or equal to the 15th percentile.
- Low percentiles always correspond to lower data values.
- High percentiles always correspond to higher data values.
A percentile may or may not correspond to a value judgment about whether it is "good" or "bad." The interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to which the data applies. In some situations,
a low percentile would be considered "good;" in other contexts a high percentile might be considered "good". In many situations, there is no value judgment that applies.
Understanding how to interpret percentiles properly is important not only
when describing data, but also when calculating probabilities in later chapters of this text.
NOTE
When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information.
- information about the context of the situation being considered
- the data value (value of the variable) that represents the percentile
- the percent of individuals or items with data values below the percentile
- the percent of individuals or items with data values above the percentile.
Example 2.20
On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.
Solution 2.20
- Twenty-five percent of students finished the exam in 35 minutes or less.
- Seventy-five percent of students finished the exam in 35 minutes or more.
- A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish).
Example 2.21
On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.
Solution 2.21
- Seventy percent of students answered 16 or fewer questions correctly.
- Thirty percent of students answered 16 or more questions correctly.
- A higher percentile could be considered good, as answering more questions correctly is desirable.
Try It 2.21
On a 60 point written assignment, the 80th percentile for the number of points earned was 49. Interpret the 80th percentile in the context of this situation.
Example 2.22
At a community college, it was found that the 30th percentile of credit units that students are enrolled for is seven units. Interpret the 30th percentile in the context of this situation.
Solution 2.22
- Thirty percent of students are enrolled in seven or fewer credit units.
- Seventy percent of students are enrolled in seven or more credit units.
- In this example, there is no "good" or "bad" value judgment associated with a higher or lower percentile. Students attend community college for varied reasons and needs, and their course load varies according to their needs.
Example 2.23
Sharpe Middle School is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students
are shown.
0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes
10 minutes; 45 minutes; 30 minutes; 300 minutes; 90 minutes;
30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes
Determine the following five values.
- Min = 0
- Q1 = 20
- Med = 40
- Q3 = 60
- Max = 300
If you were the principal, would you be justified in purchasing new fitness equipment? Since 75% of the students exercise for 60 minutes or less daily, and since the IQR is 40 minutes (60 – 20 = 40), we know that half of the students surveyed exercise
between 20 minutes and 60 minutes daily. This seems a reasonable amount of time spent exercising, so the principal would be justified in purchasing the new equipment.
However, the principal needs to be careful. The value 300 appears to be a
potential outlier.
Q3 + 1.5(IQR) = 60 + (1.5)(40) = 120.
The value 300 is greater than 120 so it is a potential outlier. If we delete it and calculate the five values, we get the following values:
- Min = 0
- Q1 = 20
- Q3 = 60
- Max = 120
We still have 75% of the students exercising for 60 minutes or less daily and half of the students exercising between 20 and 60 minutes a day. However, 15 students is a small sample and the principal should survey more students to be sure of his survey
results.
3.2.3 Measures of the Center of the Data
The "center" of a data set is also a way of describing location. The two most widely used measures of the "center" of the data are the mean (average) and the median. To calculate the mean weight of 50
people, add the 50 weights together and divide by 50. Technically this is the arithmetic mean. We will discuss the geometric mean later. To find the median weight of the 50 people, order the data and find the number that splits the
data into two equal parts meaning an equal number of observations on each side. The weight of 25 people are below this weight and 25 people are heavier than this weight. The median is generally a better measure of the center when there are extreme
values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.
NOTE
The words "mean" and "average" are often used interchangeably. The substitution of one word for the other is common practice. The technical term is "arithmetic mean" and "average" is technically a center location. Formally, the arithmetic mean is called the first moment of the distribution by mathematicians. However, in practice among non-statisticians, "average" is commonly accepted for "arithmetic mean".
When each value in the data set is not unique, the mean can be calculated by multiplying each distinct value by its frequency and then dividing the sum by the total number of data values. The letter used to represent the sample mean is an x with a bar over it (pronounced "x bar"): \(\overline x\).
The Greek letter μ (pronounced "mew") represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.
To see that both ways of calculating the mean are the same, consider the sample:
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4
\(\overline x =\frac{1+1+1+2+2+3+4+4+4+4+4}{11}=2.7\)
\(\overline x =\frac{3(1)+2(2)+1(3)+5(4)}{11}=2.7\)
In the second calculation, the frequencies are 3, 2, 1, and 5.
You can quickly find the location of the median by using the expression \(\frac{n+1}{2}\).
The letter \(n\) is the total number of data values in the sample. If \(n\) is an odd
number, the median is the middle value of the ordered data (ordered smallest to largest). If \(n\) is an even number, the median is equal to the two middle values added together and divided by two after the data has been ordered. For example, if the
total number of data values is 97, then \(\frac{n+1}{2}= \frac{97+1}{2} = 49\). The median is the 49th value in the ordered data. If the total number of data values is 100, then \(\frac{n+1}{2}= \frac{100+1}{2} = 50.5\). The median occurs
midway between the 50th and 51st values. The location of the median and the value of the median are not the same. The upper case letter M is often used to represent the median. The next example illustrates
the location of the median and the value of the median.
Example 2.24
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody drug are as follows (smallest to largest):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31;
32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
Calculate the mean and the median.
Solution 2.24
The calculation for the mean is:
\(\overline x=\frac{[3+4+(8)(2)+10+11+12+13+14+(15)(2)+(16)(2)+...+35+37+40+(44)(2)+47]}{40}=23.6\)
To find the median, M, first use the formula for the location. The location is:
\(\frac{n+1}{2}=\frac{40+1}{2}=20.5\)
Starting at the smallest value, the median is located between the 20th and 21st values (the two 24s):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
\(M=\frac{24+24}{2}=24\)
Example 2.25
Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the "center": the mean or the median?
Solution 2.25
\(\overline x =\frac{5,000,000+49(30,000)}{50}=129,400\)
\(M = 30,000\)
(There are 49 people who earn $30,000 and one person who earns $5,000,000.)
The median is a better measure of the "center" than the mean because 49 of the values
are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.
Another measure of the center is the mode. The mode is the most frequent value. There can be more than
one mode in a data set as long as those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal.
Example 2.26
Statistics exam scores for 20 students are as follows:
50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93
Find the mode.
Solution 2.26
The most frequent score is 72, which occurs five times. Mode = 72.
Example 2.27
Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 and 480 each occur twice.
When is the mode the best measure of the "center"? Consider a weight loss program that advertises a mean weight loss
of six pounds the first week of the program. The mode might indicate that most people lose two pounds the first week, making the program less appealing.
NOTE
The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red.
Calculating the Arithmetic Mean of Grouped Frequency Tables
When only grouped data is available, you do not know the individual data values (we only know intervals and interval frequencies); therefore, you cannot compute an exact mean for the data set. What we must do is estimate the actual mean by calculating
the mean of a frequency table. A frequency table is a data representation in which grouped data is displayed along with the corresponding frequencies. To calculate the mean from a grouped frequency table we can apply the basic definition of mean:
\(\textrm
mean = \frac{\textrm data \: sum}{\textrm number \: of \: data \: values}\) We simply need to modify the definition to fit within the restrictions of a frequency table.
Since we do not know the individual data values we can instead find the midpoint
of each interval. The midpoint is \(\frac{ \textrm lower \: boundary \: + \: upper \: boundary}{2}\) . We can now modify the mean definition to be \(\textrm Mean \: of \: Frequency \: Table=\frac{∑fm}{∑f}\) where \(f =\) the frequency of the interval and \(m =\) the midpoint
of the interval.
Example 2.28
A frequency table displaying professor Blount's last statistic test is shown. Find the best estimate of the class mean.
Grade interval | Number of students |
---|---|
50–56.5 | 1 |
56.5–62.5 | 0 |
62.5–68.5 | 4 |
68.5–74.5 | 4 |
74.5–80.5 | 2 |
80.5–86.5 | 3 |
86.5–92.5 | 4 |
92.5–98.5 | 1 |
Solution 2.28
- Find the midpoints for all intervals
Grade interval Midpoint 50–56.5 53.25 56.5–62.5 59.5 62.5–68.5 65.5 68.5–74.5 71.5 74.5–80.5 77.5 80.5–86.5 83.5 86.5–92.5 89.5 92.5–98.5 95.5
Table 2.25
- Calculate the sum of the product of each interval frequency and midpoint. \(∑f 53.25(1)+59.5(0)+65.5(4)+71.5(4)+77.5(2)+83.5(3)+89.5(4)+95.5(1)=1460.25\)
- \(μ=\frac{∑fm}{∑f}=\frac{1460.25}{19}=76.86\)
Try It 2.28
Maris conducted a study on the effect that playing video games has on memory recall. As part of her study, she compiled the following data:
Hours teenagers spend on video games | Number of teenagers |
---|---|
0–3.5 | 3 |
3.5–7.5 | 7 |
7.5–11.5 | 12 |
11.5–15.5 | 7 |
15.5–19.5 | 9 |
What is the best estimate for the mean number of hours spent playing video games?
3.2.4 Sigma Notation and Calculating the Arithmetic Mean
Formula for Population Mean
\(\mu=\frac{1}{N} \sum_{i=1}^{N} x_{i}\)
Formula for Sample Mean
\(\overline x =\frac{1}{n} ∑ _{i=1} ^{n} x_i\)
This unit is here to remind you of material that you once studied and said at the time "I am sure that I will never need this!"
Here are the formulas for a population mean and the sample mean. The Greek letter \(μ\) is the symbol for the population
mean and \(\overline x\) is the symbol for the sample mean. Both formulas have a mathematical symbol that tells us how to make the calculations. It is called Sigma notation because the symbol is the Greek capital letter sigma: \(Σ\). Like all mathematical
symbols it tells us what to do: just as the plus sign tells us to add and the \(x\) tells us to multiply. These are called mathematical operators. The \(Σ\) symbol tells us to add a specific list of numbers.
Let's say we have a sample of animals from the local animal shelter and we are interested in their average age. If we list each value, or observation, in a column, you can give each one an index number. The first number will be number 1 and the second number 2 and so on.
Animal | Age |
1 | 9 |
2 | 1 |
3 | 8.5 |
4 | 10.5 |
5 | 10 |
6 | 8.5 |
7 | 12 |
8 | 8 |
9 | 1 |
10 | 9.5 |
To calculate the mean we are told by the formula to add up all these numbers, ages in this case, and then divide the sum by 10, the total number of animals in the sample.
Animal number one, the cat Purr, is designated as X1, animal number 2, Toto, is designated as X2 and so on through Dundee who is animal number 10 and is designated as X10.
The i in the formula tells us which of the observations to add together. In this case it is X1 through X10 which is all of them. We know which ones to add by the indexing notation, the \(i = 1\) and the \(n\) or capital \(N\) for the population. For this example the indexing notation would be \(i = 1\) and because it is a sample we use a small \(n\) on the top of the \(Σ\) which would be 10.
The standard deviation requires the same mathematical operator and so it would be helpful to recall this knowledge from your past.
The sum of the ages is found to be 78 and dividing by 10 gives us the sample mean age as 7.8 years.
3.2.5 Geometric Mean
The mean (Arithmetic), median and mode are all measures of the "center" of the data, the "average". They are all in their own way trying to measure the "common" point within the data, that which is "normal". In the case of the arithmetic mean this is solved by finding the value from which all points are equal linear distances. We can imagine that all the data values are combined through addition and then distributed back to each data point in equal amounts. The sum of all the values is what is redistributed in equal amounts such that the total sum remains the same.
The geometric mean redistributes not the sum of the values but the product of multiplying all the individual values and then redistributing them in equal portions such that the total product remains the same. This can be seen from the formula for the geometric mean, \(\tilde x\): (Pronounced x-tilde)
\(\tilde{x}=\left(\prod_{i=1}^{n} x_{i}\right)^{\frac{1}{n}}=\sqrt[n]{x_{1} \cdot x_{2} \cdots x_{n}}=\left(x_{1} \cdot x_{2} \cdots x_{n}\right)^{\frac{1}{n}}\)
where \(π\) is another mathematical operator, that tells us to multiply all the \(x_i\) numbers in the same way capital Greek sigma tells us to add all the \(x_i\) numbers. Remember that a fractional exponent is calling for the nth root of the number thus an exponent of 1/3 is the cube root of the number.
The geometric mean answers the question, "if all the quantities had the same value, what would that value have to be in order to achieve the same product?" The geometric mean gets its name from the fact that when redistributed in this way the sides form a geometric shape for which all sides have the same length. To see this, take the example of the numbers 10, 51.2 and 8. The geometric mean is the product of multiplying these three numbers together (4,096) and taking the cube root because there are three numbers among which this product is to be distributed. Thus the geometric mean of these three numbers is 16. This describes a cube 16x16x16 and has a volume of 4,096 units.
The geometric mean is relevant in Economics and Finance for dealing with growth: growth of markets, in investment, population and other variables the growth in which there is an interest. Imagine that our box of 4,096 units (perhaps dollars) is the value of an investment after three years and that the investment returns in percents were the three numbers in our example. The geometric mean will provide us with the answer to the question, what is the average rate of return: 16 percent. The arithmetic mean of these three numbers is 23.6 percent. The reason for this difference, 16 versus 23.6, is that the arithmetic mean is additive and thus does not account for the interest on the interest, compound interest, embedded in the investment growth process. The same issue arises when asking for the average rate of growth of a population or sales or market penetration, etc., knowing the annual rates of growth. The formula for the geometric mean rate of return, or any other growth rate, is:
\(r_s=(x_1 \cdot x_2⋅⋅⋅x_n)^{\frac{1}{n}}−1\)
Manipulating the formula for the geometric mean can also provide a calculation of the average rate of growth between two periods knowing only the initial value \(a_0\) and the ending value \(a_n\) and the number of periods, \(n\). The following formula provides this information:
\((\frac{a_n}{a_0})^{\frac{1}{n}}= \tilde x\)
Finally, we note that the formula for the geometric mean requires that all numbers be positive, greater than zero. The reason of course is that the root of a negative number is undefined for use outside of mathematical theory. There are ways to avoid this problem however. In the case of rates of return and other simple growth problems we can convert the negative values to meaningful positive equivalent values. Imagine that the annual returns for the past three years are +12%, -8%, and +2%. Using the decimal multiplier equivalents of 1.12, 0.92, and 1.02, allows us to compute a geometric mean of 1.0167. Subtracting 1 from this value gives the geometric mean of +1.67% as a net rate of population growth (or financial return). From this example we can see that the geometric mean provides us with this formula for calculating the geometric (mean) rate of return for a series of annual rates of return:
\(r_s = \tilde x−1\)
where \(r_s\) is average rate of return and \( \tilde x\) is the geometric mean of the returns during some number of time periods. Note that the length of each time period must be the same.
As a general rule one should convert the percent values to its decimal equivalent multiplier. It is important to recognize that when dealing with percents, the geometric mean of percent values does not equal the geometric mean of the decimal multiplier equivalents and it is the decimal multiplier equivalent geometric mean that is relevant.
3.2.6 Skewness and the Mean, Median, and Mode
Consider the following data set.
4; 5; 6; 6; 6; 7; 7; 7; 7; 7; 7; 8; 8; 8; 9; 10
This data set can be represented by following histogram. Each interval has width one, and each value is located in the middle of an interval.
Figure 2.11
The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each seven for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal), and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median.
The histogram for the data: 4; 5; 6; 6; 6; 7; 7; 7; 7; 8 shown in Figure 2.11 is not symmetrical. The right-hand side seems "chopped off" compared to the left side. A distribution of this type is called skewed to the left because it is pulled out to the left. We can formally measure the skewness of a distribution just as we can mathematically measure the center weight of the data or its general "speadness". The mathematical formula for skewness is: \(a_3=∑ \frac{(x_i−\overline x)^3}{ns^3}\). The greater the deviation from zero indicates a greater degree of skewness. If the skewness is negative then the distribution is skewed left as in Figure 2.12. A positive measure of skewness indicates right skewness such as Figure 2.13.
Figure 2.12
The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so.
The histogram for the data: 6; 7; 7; 7; 7; 8; 8; 8; 9; 10 shown in Figure 2.12, is also not symmetrical. It is skewed to the right.
Figure 2.13
The mean is 7.7, the median is 7.5, and the mode is seven. Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.
To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.
As with the mean, median and mode, and as we will see shortly, the variance, there are mathematical formulas that give us precise measures of these characteristics of the distribution of the data. Again looking at the formula for skewness we see that this is a relationship between the mean of the data and the individual observations cubed.
\(a_3=∑ \frac{(x_i − \overline x)^3}{ns^3}\)
where s is the sample standard deviation of the data, \(X_i\) , and \(\overline x\) is the arithmetic mean and \(n\) is the sample size.
Formally the arithmetic mean is known as the first moment of the distribution. The second moment we will see is the variance, and skewness is the third moment. The variance measures the squared differences of the data from the mean and skewness measures the cubed differences of the data from the mean. While a variance can never be a negative number, the measure of skewness can and this is how we determine if the data are skewed right of left. The skewness for a normal distribution is zero, and any symmetric data should have skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. The skewness characterizes the degree of asymmetry of a distribution around its mean. While the mean and standard deviation are dimensional quantities (this is why we will take the square root of the variance ) that is, have the same units as the measured quantities \(X_i\), the skewness is conventionally defined in such a way as to make it nondimensional. It is a pure number that characterizes only the shape of the distribution. A positive value of skewness signifies a distribution with an asymmetric tail extending out towards more positive \(X\) and a negative value signifies a distribution whose tail extends out towards more negative \(X\). A zero measure of skewness will indicate a symmetrical distribution.
Skewness and symmetry become important when we discuss probability distributions in later chapters.
3.2.7 Measures of the Spread of the Data
An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean.
The standard deviation
- provides a numerical measure of the overall amount of variation in a data set, and
- can be used to determine whether a particular data value is close to or far from the mean.
The standard deviation provides a measure of the overall variation in a data set
The standard deviation is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little
variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.
Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket A and supermarket B. The average wait time at both supermarkets is five minutes. At supermarket A, the standard deviation for the wait time is two minutes; at supermarket B. The standard deviation for the wait time is four
minutes.
Because supermarket B has a higher standard deviation, we know that there is more variation in the wait times at supermarket B. Overall, wait times at supermarket B are more spread out from the average; wait times
at supermarket A are more concentrated near the average.
Calculating the Standard Deviation
If \(x\) is a number, then the difference "\(x\) minus the mean" is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong
to a population, in symbols a deviation is \(x – μ\). For sample data, in symbols a deviation is \(x – \overline x\).
The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from
a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s represents the sample standard deviation
and the Greek letter \(σ\) (sigma, lower case) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of \(σ\).
To calculate the standard deviation, we need
to calculate the variance first. The variance is the average of the squares of the deviations (the \(x – \overline x \) values for a sample, or the \(x – μ\) values for a population). The symbol \(σ^2\) represents
the population variance; the population standard deviation \(σ\) is the square root of the population variance. The symbol \(s^2\) represents the sample variance; the sample standard deviation \(s\) is the square root of the sample variance. You can
think of the standard deviation as a special average of the deviations. Formally, the variance is the second moment of the distribution or the first moment around the mean. Remember that the mean is the first moment of the distribution.
If
the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by \(N\), the number of items in the population. If the data are from
a sample rather than a population, when we calculate the average of the squared deviations, we divide by \(n – 1\), one less than the number of items in the sample.
Formulas for the Sample Standard Deviation
- \(s=\sqrt{\frac{\Sigma(x-\bar{x})^{2}}{n-1}} \text { or } s=\sqrt{\frac{\Sigma f(x-\bar{x})^{2}}{n-1}} \text { or } s=\sqrt{\frac{\left(\sum_{i=1}^{n} x^{2}\right)-n \bar{x}^{2}}{n-1}}\)
- For the sample standard deviation, the denominator is n - 1, that is the sample size minus 1.
Formulas for the Population Standard Deviation
- \(\sigma=\sqrt{\frac{\Sigma(x-\mu)^{2}}{N}} \text { or } \sigma=\sqrt{\frac{\Sigma f(x-\mu)^{2}}{N}} \text { or } \sigma=\sqrt{\frac{\sum_{i=1}^{N} x_{i}^{2}}{N}-\mu^{2}}\)
- For the population standard deviation, the denominator is \(N\), the number of items in the population.
In these formulas, \(f\) represents the frequency with which a value appears. For example, if a value appears once, \(f\) is one. If a value appears three times in the data set or population, \(f\) is three. Two important observations concerning the
variance and standard deviation: the deviations are measured from the mean and the deviations are squared. In principle, the deviations could be measured from any point, however, our interest is measurement from the center weight of the data, what
is the "normal" or most usual value of the observation. Later we will be trying to measure the "unusualness" of an observation or a sample mean and thus we need a measure from the mean. The second observation is that the deviations are squared. This
does two things, first it makes the deviations all positive and second it changes the units of measurement from that of the mean and the original observations. If the data are weights then the mean is measured in pounds, but the variance is measured
in pounds-squared. One reason to use the standard deviation is to return to the original units of measurement by taking the square root of the variance. Further, when the deviations are squared it explodes their value. For example, a deviation of
10 from the mean when squared is 100, but a deviation of 100 from the mean is 10,000. What this does is place great weight on outliers when calculating the variance.
Types of Variability in Samples
When trying to study a population, a sample is often used, either for convenience or because it is not possible to access the entire population. Variability is the term used to describe the differences that may occur in these outcomes. Common types of
variability include the following:
- Observational or measurement variability
- Natural variability
- Induced variability
- Sample variability
Here are some examples to describe each type of variability.
Example 1: Measurement variability
Measurement variability occurs when there are differences in the instruments used to measure or in the people using those instruments. If we are gathering data on how long it takes for a ball to drop from a height by having students measure the time of the drop with a stopwatch, we may experience measurement variability if the two stopwatches used were made by different manufacturers: For example, one stopwatch measures to the nearest second, whereas the other one measures to the nearest tenth of a second. We also may experience measurement variability because two different people are gathering the data. Their reaction times in pressing the button on the stopwatch may differ; thus, the outcomes will vary accordingly. The differences in outcomes may be affected by measurement variability.
Example 2: Natural variability
Natural variability arises from the differences that naturally occur because members of a population differ from each other. For example, if we have two identical corn plants and we expose both plants to the same amount of water and sunlight, they may still grow at different rates simply because they are two different corn plants. The difference in outcomes may be explained by natural variability.
Example 3: Induced variability
Induced variability is the counterpart to natural variability; this occurs because we have artificially induced an element of variation (that, by definition, was not present naturally): For example, we assign people to two different groups to study memory, and we induce a variable in one group by limiting the amount of sleep they get. The difference in outcomes may be affected by induced variability.
Example 4: Sample variability
Sample variability occurs when multiple random samples are taken from the same population. For example, if I conduct four surveys of 50 people randomly selected from a given population, the differences in outcomes may be affected by sample variability.
Example 2.29
In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half
year:
9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5;
\(\overline x =\frac{9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3)}{20}=10.525\)
The average age is 10.53 years, rounded to two places.
The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square root of the variance. We will explain the parts of the table after calculating s.
Data | Freq. | Deviations | Deviations2 | (Freq.)(Deviations2) |
---|---|---|---|---|
\(x\) | \(f\) | \((x – \overline x)\) | \((x –\overline x)^2\) | \((f)(x –\overline x)^2\) |
9 | 1 | 9 – 10.525 = –1.525 | (–1.525)2 = 2.325625 | 1 × 2.325625 = 2.325625 |
9.5 | 2 | 9.5 – 10.525 = –1.025 | (–1.025)2 = 1.050625 | 2 × 1.050625 = 2.101250 |
10 | 4 | 10 – 10.525 = –0.525 | (–0.525)2 = 0.275625 | 4 × 0.275625 = 1.1025 |
10.5 | 4 | 10.5 – 10.525 = –0.025 | (–0.025)2 = 0.000625 | 4 × 0.000625 = 0.0025 |
11 | 6 | 11 – 10.525 = 0.475 | (0.475)2 = 0.225625 | 6 × 0.225625 = 1.35375 |
11.5 | 3 | 11.5 – 10.525 = 0.975 | (0.975)2 = 0.950625 | 3 × 0.950625 = 2.851875 |
|
|
|
|
The total is 9.7375 |
The sample variance, \(s^2\), is equal to the sum of the last column (9.7375) divided by the total number of data values minus one (20 – 1):
\(s_2=\frac{9.7375}{20−1}=0.5125\)
The sample standard deviation s is equal to the square root of the sample variance:
\(s=\sqrt{0.5125} = 0.715891\), which is rounded to two decimal places, \(s = 0.72\).
Explanation of the standard deviation calculation shown in the table
The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than is the data value 11 which is indicated by the deviations 0.97 and 0.47. A positive deviation occurs when the data value is greater than
the mean, whereas a negative deviation occurs when the data value is less than the mean. The deviation is –1.525 for the data value nine. If you add the deviations, the sum is always zero. (For Example 2.29, there are n = 20 deviations.)
So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation. By squaring the deviations
we are placing an extreme penalty on observations that are far from the mean; these observations get greater weight in the calculations of the variance. We will see later on that the variance (standard deviation) plays the critical role in determining
our conclusions in inferential statistics. We can begin now by using the standard deviation as a measure of "unusualness." "How did you do on the test?" "Terrific! Two standard deviations above the mean." This, we will see, is an unusually good exam
grade.
The variance is a squared measure and does not have the same units as the data. Taking the square root solves the problem. The standard deviation measures the spread in the same units as the data.
Notice that instead of dividing by
\(n = 20\), the calculation divided by \(n – 1 = 20 – 1 = 19\) because the data is a sample. For the sample variance, we divide by the sample size minus one \((n – 1)\). Why not divide by \(n\)? The answer has to do with the population
variance. The sample variance is an estimate of the population variance. This estimate requires us to use an estimate of the population mean rather than the actual population mean. Based on the theoretical mathematics that lies behind
these calculations, dividing by \((n – 1)\) gives a better estimate of the population variance.
The standard deviation, \(s\) or \(σ\), is either zero or larger than zero. Describing the data with reference to the spread is called "variability".
The variability in data depends upon the method by which the outcomes are obtained; for example, by measuring or by random sampling. When the standard deviation is zero, there is no spread; that is, the all the data values are equal to each other.
The standard deviation is small when the data are all concentrated close to the mean, and is larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out
about the mean; outliers can make \(s\) or \(σ\) very large.
Example 2.30
Use the following data (first exam scores) from Susan Dean's spring pre-calculus class:
33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94; 96; 100
a. Create a chart containing
the data, frequencies, relative frequencies, and cumulative relative frequencies to three decimal places.
b. Calculate the following to one decimal place:
- The sample mean
- The sample standard deviation
- The median
- The first quartile
- The third quartile
- IQR
Solution 2.30
a. See Table 2.29
b.
- The sample mean = 73.5
- The sample standard deviation = 17.9
- The median = 73
- The first quartile = 61
- The third quartile = 90
- IQR = 90 – 61 = 29
Data | Frequency | Relative frequency | Cumulative relative frequency |
---|---|---|---|
33 | 1 | 0.032 | 0.032 |
42 | 1 | 0.032 | 0.064 |
49 | 2 | 0.065 | 0.129 |
53 | 1 | 0.032 | 0.161 |
55 | 2 | 0.065 | 0.226 |
61 | 1 | 0.032 | 0.258 |
63 | 1 | 0.032 | 0.29 |
67 | 1 | 0.032 | 0.322 |
68 | 2 | 0.065 | 0.387 |
69 | 2 | 0.065 | 0.452 |
72 | 1 | 0.032 | 0.484 |
73 | 1 | 0.032 | 0.516 |
74 | 1 | 0.032 | 0.548 |
78 | 1 | 0.032 | 0.580 |
80 | 1 | 0.032 | 0.612 |
83 | 1 | 0.032 | 0.644 |
88 | 3 | 0.097 | 0.741 |
90 | 1 | 0.032 | 0.773 |
92 | 1 | 0.032 | 0.805 |
94 | 4 | 0.129 | 0.934 |
96 | 1 | 0.032 | 0.966 |
100 | 1 | 0.032 | 0.998 (Why isn't this value 1? ANSWER: Rounding) |
Standard deviation of Grouped Frequency Tables
where \(f=\) interval frequencies and \(m = \) interval midpoints.
Just as we could not find the exact mean, neither can we find the exact standard deviation. Remember that standard deviation describes numerically the expected deviation a data value has from the mean. In simple English, the standard deviation allows us to compare how "unusual" individual data is compared to the mean.
Example 2.31
Class | Frequency,\(f\) | Midpoint, \(m\) | \(f \cdot m\) | \(f(m - \overline x)^2\) |
---|---|---|---|---|
0–2 | 1 | 1 | 1 ⋅ 1 = 1 | 1(1−6.88)2 = 34.57 |
3–5 | 6 | 4 | 6 ⋅ 4 = 24 | 6(4−6.88)2 = 49.77 |
6-8 | 10 | 7 | 10 ⋅ 7 = 70 | 10(7−6.88)2 = 0.14 |
9-11 | 7 | 10 | 7 ⋅ 10 = 70 | 7(10−6.88)2 = 68.14 |
12-14 | 0 | 13 | 0 ⋅ 13 = 0 | 0(13−6.88)2 = 0 |
\(n = 24\) | \(\overline x = \frac{165}{24}=6.88\) | \(s^2=\frac{152.62}{24−1}=6.64\) |
Comparing Values from Different Data Sets
- For each data value \(x\), calculate how many standard deviations away from its mean the value is.
- Use the formula: \(x\) = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs.
- #ofSTDEVs=\(\frac{x – mean}{standard \: deviation}\)
- Compare the results of this calculation.
Sample | \(x \overline x + zs\) | \(z =\frac{x−\overline x}{s}\) |
Population | \(x = μ+ zσ\) | \(z=\frac{x−μ}{σ}\) |
Example 2.32
Student | GPA | School mean GPA | School standard deviation |
---|---|---|---|
John | 2.85 | 3.0 | 0.7 |
Ali | 77 | 80 | 10 |
For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the average, for his school. Pay careful attention to signs when comparing and interpreting the answer.
\(z= \)# of STDEXs= \(\frac{value \: – \: mean}{standard \: deviation} = \frac{x−μ}{σ}\)
For John, \(z=\) #ofSTDEVs \(= \frac{2.85–3.0}{0.7}=–0.21\)
For Ali, \(z=\) #ofSTDEVs \(=\frac{77−80}{10}=−0.3\)
John has the better GPA when compared to his school because his GPA is 0.21 standard deviations below his school's mean while Ali's GPA is 0.3 standard deviations below his school's mean.
John's z-score of –0.21 is higher than Ali's z-score of –0.3. For GPA, higher values are better, so we conclude that John has the better GPA when compared to his school.
Try It 2.32
Swimmer | Time (seconds) | Team mean time | Team standard deviation |
---|---|---|---|
Angie | 26.2 | 27.2 | 0.8 |
Beth | 27.3 | 30.1 | 1.4 |
- At least 75% of the data is within two standard deviations of the mean.
- At least 89% of the data is within three standard deviations of the mean.
- At least 95% of the data is within 4.5 standard deviations of the mean.
- This is known as Chebyshev's Rule.
- Approximately 68% of the data is within one standard deviation of the mean.
- Approximately 95% of the data is within two standard deviations of the mean.
- More than 99% of the data is within three standard deviations of the mean.
- This is known as the Empirical Rule.
- It is important to note that this rule only applies when the shape of the distribution of the data is bell-shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" probability distribution in later chapters.
Coefficient of Variation
Another useful way to compare distributions besides simple comparisons of means or standard deviations is to adjust for differences in the scale of the data being measured. Quite simply, a large variation in data with a large mean is different than the same variation in data with a small mean. To adjust for the scale of the underlying data the Coefficient of Variation (CV) has been developed. Mathematically:4. Spreadsheet Exercises: Measures of Central Tendency and Spread
Measures of Central Tendency
Mode, Median, Mean, and Midrange
This section explores the measures of central tendency: mode, median, mean and midrange. It shows you the commands for computing these metrics using a spreadsheet program and it gives you the correct format for entering those commands into the spreadsheet program.
Mode
The mode is the value that occurs most frequently in the data. Spreadsheet programs such as Microsoft Excel or OpenOffice.org Calc can determine the mode with the function MODE.
=MODE(data)
In the Fall of 2000 the statistics class gathered data on the number of siblings for each member of the class. One student was an only child and had no siblings. One student had 13 brothers and sisters. The complete data set is as follows:
1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 7, 8, 9, 10 ,12 ,12 ,13
The mode is 2 because 2 occurs more often than any other value. Where there is a tie there is no mode.
For the ages of students in that class
18, 19, 19, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 25, 25, 26
...there is no mode: there is a tie between 21 and 22, hence there no single must frequent value. Spreadsheets will, however, usually report a mode of 21 in this case. Spreadsheets often select the first mode in a multi-modal tie.
If all values appear only once, then there is no mode. Spreadsheets will display #N/A or #VALUE to indicate an error has occurred - there is no mode. No mode is NOT the same as a mode of zero. A mode of zero means that zero is the most frequent data value. Do not put the number 0 (zero) for "no mode." An example of a mode of zero might be the number of children for students in statistics class.
Median
The median is the central (or middle) value in a data set. If a number sits at the middle, then it is the median. If the middle is between two numbers, then the median is half way between the two middle numbers.
For the sibling data...
1, 2, 2, 2, 2, 2, 3, 3, 4, 4, |4| ,5 ,5 ,5 ,7 ,8 ,9 ,10 ,12 ,12 ,13
...the median is 4.
Note the data must be in order (sorted) before you can find the median. For the data 2, 4, 6, 8 the median is 5: (4+6)/2.
The median function in spreadsheets is MEDIAN.
=MEDIAN(data)
Mean (average)
The mean, also called the arithmetic mean and also called the average, is calculated mathematically by adding the values and then dividing by the number of values (the sample size n).
If the mean is the mean of a population, then it is called the population mean μ. The letter μ is a Greek lower case "m" and is pronounced "mu."
If the mean is the mean of a sample, then it is the sample mean \(\overline x\). The symbol \(\overline x\) is pronounced "x bar."
population mean \(μ = \frac{sum \: of \: the \: population \: data}{population \: size \: N} = \frac{∑X}{N}\)
sample mean \(\overline x = \frac{sum \: of \: the \: sample\: data}{sample\: size \: N} = \frac{∑X}{N} \)
The sum of the data ∑ x can be determined using the function =SUM(data). The sample size n can be determined using =COUNT(data). Thus =SUM(data)/COUNT(data) will calculate the mean. There is also a single function that calculates the mean. The function that directly calculates the mean is AVERAGE
=AVERAGE(data)
Resistant measures: One that is not influenced by extremely high or extremely low data values. The median tends to be more resistant than mean.
Population mean and sample mean
If the mean is measured using the whole population then this would be the population mean. If the mean was calculated from a sample then the mean is the sample mean. Mathematically there is no difference in the way the population and sample mean are calculated.
Midrange
The midrange is the midway point between the minimum and the maximum in a set of data.
To calculate the minimum and maximum values, spreadsheets use the minimum value function MIN and maximum value function MAX.
=MIN(data)
=MAX(data)
The MIN and MAX function can take a list of comma separated numbers or a range of cells in a spreadsheet. If the data is in cells A2 to A42, then the minimum and maximum can be found from:
=MIN(A2:A42)
=MAX(A2:A42)
The midrange can then be calculated from:
midrange = (maximum + minimum)/2
In a spreadsheet use the following formula:
=(MAX(data)+MIN(data))/2
Source: Dana Lee Ling, http://www.comfsm.fm/~dleeling/statistics/text5.html
5. Spreadsheet Exercises: Graphs of Histograms and Frequency Tables
5.1 Graphs and Charts
The table below includes FSM census 2000 data and student seat numbers for the national site of COM-FSM circa 2004.
State | Population (2000) | Fractional share of national population (relative frequency) | Number of student seats held by state at the national campus | Fractional share of the national campus student seats |
---|---|---|---|---|
Chuuk | 53595 | 0.5 | 679 | 0.2 |
Kosrae | 7686 | 0.07 | 316 | 0.09 |
Pohnpei | 34486 | 0.32 | 2122 | 0.62 |
Yap | 11241 | 0.11 | 287 | 0.08 |
107008 | 1 | 3404 | 1 |
Circle or pie charts
In a circle chart the whole circle is 100% Used when data adds to a whole, e.g. state populations add to yield national population.
A pie chart of the state populations:
The following table includes data from the 2010 FSM census as an update to the above data.
State | Population (2010) | Relative frequency |
---|---|---|
Chuuk | 48651 | |
Kosrae | 6616 | |
Pohnpei | 35981 | |
Yap | 11376 | |
Sum: | 102624 | |
Column charts
Column charts are also called bar graphs. A column chart of the student seats held by each state at the national site:
Pareto chart
If a column chart is sorted so that the columns are in descending order, then it is called a Pareto chart. Descending order means the largest value is on the left and the values decrease as one moves to the right. Pareto charts are useful ways to convey rank order as well as numerical data.
Line graph
A line graph is a chart which plots data as a line. The horizontal axis is usually set up with equal intervals. Line graphs are not used in this course and should not be confused with xy scattergraphs.
XY Scatter graph
When you have two sets of continuous data (value versus value, no categories), use an xy graph. These will be covered in more detail in the chapter on linear regressions.
Source: Dana Lee Ling; http://www.comfsm.fm/~dleeling/statistics/text5.html
5.2 Shapes of Distributions
The shapes of distributions have names by which they are known.
One of the aspects of a sample that is often similar to the population is the shape of the distribution. If a good random sample of sufficient size has a symmetric distribution, then the population is likely to have a symmetric distribution. The process of projecting results from a sample to a population is called generalizing. Thus we can say that the shape of a sample distribution generalizes to a population.
uniform | peaked symmetric | skewed |
---|---|---|
1 | 1 | 1 |
2 | 5 | 5 |
3 | 7 | 8 |
4 | 9 | 9 |
5 | 10 | 11 |
6 | 11 | 12 |
7 | 12 | 13 |
8 | 12 | 14 |
9 | 13 | 15 |
10 | 13 | 16 |
11 | 14 | 17 |
12 | 14 | 18 |
13 | 14 | 19 |
14 | 14 | 20 |
15 | 15 | 20 |
16 | 15 | 21 |
17 | 15 | 22 |
18 | 15 | 23 |
19 | 16 | 24 |
20 | 16 | 23 |
21 | 17 | 24 |
22 | 17 | 25 |
23 | 18 | 26 |
24 | 19 | 27 |
25 | 20 | 25 |
26 | 22 | 26 |
27 | 24 | 27 |
28 | 28 | 28 |
Both box plots and frequency histograms show the distribution of the data. Box plots and frequency histograms are two different views of the distribution of the data. There is a relationship between the frequency histogram and the associated box plot. The following charts show the frequency histograms and box plots for three distributions: a uniform distribution, a peaked symmetric heap distribution, and a left skewed distribution.
The uniform data is evenly distributed across the range. The whiskers run from the maximum to minimum value and the InterQuartile Range is the largest of the three distributions.
The peaked symmetric data has the smallest InterQuartile Range, the bulk of the data is close to the middle of the distribution. In the box plot this can be seen in the small InterQuartile range centered on the median. The peaked symmetric data has two potential outliers at the minimum and maximum values. For the peaked symmetric distribution data is usually found near the middle of the distribution.
The skewed data has the bulk of the data near the maximum. In the box plot this can be seen by the InterQuartile Range - the box - being "pushed" up towards the maximum value. The whiskers are also of an unequal length, another sign of a skewed distribution.
Creating histograms with spreadsheets
Making histograms with LibreOffice.org Calc
Select both the column with the class and the column with the frequencies.
Click on the chart wizard button. If not selected by default, choose a column chart. Click on Next.
At the second step, the data range step, select "First column as label" as seen in the next image.
Click on Next. At step three there is usually nothing that needs to be done if one has correctly selected their columns prior to starting the chart wizard.
Click on Next. On the next screen fill in the appropriate titles. The legend can be "unchecked" as seen in the next image.
When done, click on Finish. Double click any column in the chart to open up the data series dialog box.
Click on the options tab and set the Spacing to zero percent as seen in the previous image. In the data series dialog box one can alter the background color, add column borders, or make other customizations. Click on OK.
Gnumeric histogram notes
Select both the class upper limits and the frequencies. Choose the chart wizard. At the first step of the chart wizard select the Column chart option. Click on "Use first series as shared abscissa". The first series is the first column, the class upper limits. The abscissa is another word for x-axis.
At step two of two, select PlotBarCol1 and set the Gap to zero.
In step two a title can be added to Chart1 by clicking on Chart1 and then clicking on the Add button. The drop down menu includes the item "Title to Chart1" in alphabetic order on the list. To add a label to Y-Axis1, click on Add and then choose "Label to Y-Axis1". When one has made all desired modifications, click on Insert and then drag to size the chart.
As an anecdote, dragging to choose the size of the chart is the way Microsoft Excel 95 operated. While this may seem retro, this is an instructor's blessing. No two students are going to execute the exact same drag, hence no two homework assignments should have exactly the same size chart in exactly the same location.
Making histograms with Microsoft Excel 97/2000/XP
Select ONLY the column with the column with the frequencies. Click on the chart wizard.
Click on next.
In step 2 of 4, click on the series tab
Click in the Category (X) axis labels text box
Select the class upper limits by dragging with the mouse. Click on next when done.
Fill in the appropriate titles and then click on finish.
Double click any column to open up the Format Data series dialog box.
Click on the options tab and set the gap width to zero.
Click on OK.
Making histograms with Microsoft Excel 2007
Excel 2007 and 2010 are vastly different from earlier versions of Excel. The differences are beyond cosmetic and involve a fundamental shift in the philosophy, the gestalt if you will, of the interface. Excel 2010 made cosmetic improvements to ribbon background colors in an attempt to improve usability.
Note these examples use different data than the examples above. The original data derives from speed of sound measurements made by the physical science class.
Fundamentally the program violates the old precept of reducing the number of modalities for a user interface. These are where the user interface shows and hides menus according to a mode setting. Office 2007 turns this on its head and is all about modes. The program opens in the "Home" mode, a basic editing mode. The main menus are replaced by a structure called "the ribbon" seen in the image below.
Home
In the home mode the chart wizard is hidden from view. Click on the Insert tab on the ribbon.
Insert
The charts section the ribbon is horizontally compressed in the image above. The chart section usually appears as follows.
Charts
Select the data to be charted in the histogram, and then click on the column button.
Select data and then column button.
Select the chart subtype.
Chart subtype selection
The chart appears.
Right click on the chart to pop-up the chart context menu. Choose "Select Data"
Context menu
Remove the class upper limits (CUL) item from the Legend Series column.
Click on "Edit" in the Horizontal (Category) Axis Labels column.
After clicking "Edit" the screen highlights the existing frequency column.
Select the class upper limits (classes). Click OK.
Click OK again. To set the gap width (spacing) to zero, right-mouse click on the series and choose Format Data Series.
Set the gap width to zero.
Gap width setting
The result is a tad cartoonish - borderless columns - but that is a default style for Excel 2007.
Borderless columns
One can delete the legend, but x and y axis labels are usually necessary. Adding these is possibly the most non-obvious step for an OpenOffice.org or Excel 97/2000 user.
Note at the top of the Excel screen that there is a tab marked "Design". The two words to the right are also tabs, camoflaged to not look like a tab. Click on the camoflaged Layout tab.
Now select Axis Titles: Primary Horizontal Axis Title: Title Below Axis sub-sub-menu. This adds an x-axis label which one can then edit.
To obtain a y-axis label, select Axis Titles: Primary Vertical Axis Title: Rotated Title. This will add a y-axis title. Edit that title.
Source: Dana Lee Ling; http://www.comfsm.fm/~dleeling/statistics/text5.html