Descriptive and Inferential Statistics
Site: | Saylor Academy |
Course: | MA121: Introduction to Statistics |
Book: | Descriptive and Inferential Statistics |
Printed by: | Guest user |
Date: | Saturday, June 10, 2023, 5:37 PM |
Description
Read these sections and complete the questions at the end of each section. Here, we introduce descriptive statistics using examples and discuss the difference between descriptive and inferential statistics. We also talk about samples and populations, explain how you can identify biased samples, and define differential statistics.
Descriptive Statistics
Learning Objectives
- Define "descriptive statistics"
- Distinguish between descriptive statistics and inferential statistics
Descriptive statistics are numbers that are used to summarize and describe data. The word "data" refers to the information that has been collected from an experiment, a survey, a historical record, etc. (By the way, "data" is plural. One piece of information is called a "datum"). If we are analyzing birth certificates, for example, a descriptive statistic might be the percentage of certificates issued in New York State, or the average age of the mother. Any other number we choose to compute also counts as a descriptive statistic for the data from which the statistic is computed. Several descriptive statistics are often used at one time to give a full picture of the data.
Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand. Generalizing from our data to another set of cases is the business of inferential statistics, which you'll be studying in another section. Here we focus on (mere) descriptive statistics.
Some descriptive statistics are shown in Table 1. The table shows the average salaries for various occupations in the United States in 1999.
Table 1. Average salaries for various occupations in 1999.
$112,760 | pediatricians |
$106,130 | dentists |
$100,090 | podiatrists |
$ 76,140 | physicists |
$ 53,410 | architects |
$ 49,720 | school, clinical, and counseling psychologists |
$ 47,910 | flight attendants |
$ 39,560 | elementary school teachers |
$ 38,710 | police officers |
$ 18,980 | floral designers |
Descriptive statistics like these offer insight into American society. It is interesting to note, for example, that we pay the people who educate our children and who protect our citizens a great deal less than we pay people who take care of our feet or our teeth.
For more descriptive statistics, consider Table 2 which shows the number of unmarried men per 100 unmarried women in U.S. Metro Areas in 1990. From this table we see that men outnumber women most in Jacksonville, NC, and women outnumber men most in Sarasota, FL. You can see that descriptive statistics can be useful if we are looking for an opposite-sex partner! (These data come from the Information Please Almanac).
Table 2. Number of unmarried men per 100 unmarried women in U.S. Metro Areas in 1990.
Cities with mostly men | Men per 100 Women | Cities with mostly women | Men per 100 Women |
---|---|---|---|
1. Jacksonville, NC |
224
|
1. Sarasota, FL |
66
|
2. Killeen-Temple, TX |
123
|
2. Bradenton, FL |
68
|
3. Fayetteville, NC |
118
|
3. Altoona, PA |
69
|
4. Brazoria, TX |
117
|
4. Springfield, IL |
70
|
5. Lawton, OK |
116
|
5. Jacksonville, TN |
70
|
6. State College, PA |
113
|
6. Gadsden, AL |
70
|
7. Clarksville-Hopkinsville, TN-KY |
113
|
7. Wheeling, WV |
70
|
8. Anchorage, Alaska |
112
|
8. Charleston, WV |
71
|
9. Salinas-Seaside-Monterey, CA |
112
|
9. St. Joseph, MO |
71
|
10. Bryan-College Station, TX |
111
|
10. Lynchburg, VA |
71
|
NOTE: Unmarried includes never-married, widowed, and divorced persons, 15 years or older.
These descriptive statistics may make us ponder why the numbers are so disparate in these cities. One potential explanation, for instance, as to why there are more women in Florida than men may involve the fact that elderly individuals tend to move down to the Sarasota region and that women tend to outlive men. Thus, more women might live in Sarasota than men. However, in the absence of proper data, this is only speculation.
You probably know that descriptive statistics are central to the world of sports. Every sporting event produces numerous statistics such as the shooting percentage of players on a basketball team. For the Olympic marathon (a foot race of 26.2 miles), we possess data that cover more than a century of competition. (The first modern Olympics took place in 1896). The following table shows the winning times for both men and women (the latter have only been allowed to compete since 1984).
Women | |||
Year | Winner | Country | Time |
---|---|---|---|
1984 | Joan Benoit | USA | 2:24:52 |
1988 | Rosa Mota | POR | 2:25:40 |
1992 | Valentina Yegorova | UT | 2:32:41 |
1996 | Fatuma Roba | ETH | 2:26:05 |
2000 | Naoko Takahashi | JPN | 2:23:14 |
2004 | Mizuki Noguchi | JPN | 2:26:20 |
Men | |||
Year | Winner | Country | Time |
1896 | Spiridon Louis | GRE | 2:58:50 |
1900 | Michel Theato | FRA | 2:59:45 |
1904 | Thomas Hicks | USA | 3:28:53 |
1906 | Billy Sherring | CAN | 2:51:23 |
1908 | Johnny Hayes | USA | 2:55:18 |
1912 | Kenneth McArthur | S. Afr. | 2:36:54 |
1920 | Hannes Kolehmainen | FIN | 2:32:35 |
1924 | Albin Stenroos | FIN | 2:41:22 |
1928 | Boughra El Ouafi | FRA | 2:32:57 |
1932 | Juan Carlos Zabala | ARG | 2:31:36 |
1936 | Sohn Kee-Chung | JPN | 2:29:19 |
1948 | Delfo Cabrera | ARG | 2:34:51 |
1952 | Emil Ztopek | CZE | 2:23:03 |
1956 | Alain Mimoun | FRA | 2:25:00 |
1960 | Abebe Bikila | ETH | 2:15:16 |
1964 | Abebe Bikila | ETH | 2:12:11 |
1968 | Mamo Wolde | ETH | 2:20:26 |
1972 | Frank Shorter | USA | 2:12:19 |
1976 | Waldemar Cierpinski | E.Ger | 2:09:55 |
1980 | Waldemar Cierpinski | E.Ger | 2:11:03 |
1984 | Carlos Lopes | POR | 2:09:21 |
1988 | Gelindo Bordin | ITA | 2:10:32 |
1992 | Hwang Young-Cho | S. Kor | 2:13:23 |
1996 | Josia Thugwane | S. Afr. | 2:12:36 |
2000 | Gezahenge Abera | ETH | 2:10.10 |
2004 | Stefano Baldini | ITA | 2:10:55 |
There are many descriptive statistics that we can compute from the data in the table. To gain insight into the improvement in speed over the years, let us divide the men's times into two pieces, namely, the first 13 races (up to 1952) and the second 13 (starting from 1956). The mean winning time for the first 13 races is 2 hours, 44 minutes, and 22 seconds (written 2:44:22). The mean winning time for the second 13 races is 2:13:18. This is quite a difference (over half an hour). Does this prove that the fastest men are running faster? Or is the difference just due to chance, no more than what often emerges from chance differences in performance from year to year? We can't answer this question with descriptive statistics alone. All we can affirm is that the two means are "suggestive".
Examining Table 3 leads to many other questions. We note that Takahashi (the lead female runner in 2000) would have beaten the male runner in 1956 and all male runners in the first 12 marathons. This fact leads us to ask whether the gender gap will close or remain constant. When we look at the times within each gender, we also wonder how much they will decrease (if at all) in the next century of the Olympics. Might we one day witness a sub-2 hour marathon? The study of statistics can help you make reasonable guesses about the answers to these questions.
Source: Mikki Hebl, https://onlinestatbook.com/2/introduction/descriptive.html This work is in the Public Domain.
Video
Questions
Question 1 out of 3.
Two basic divisions of statistics are
- inferential and descriptive.
- population and sample.
- sampling and scaling.
- mean and median.
Question 2 out of 3.
Check all that apply. Descriptive statistics
- allow random assignment to experimental conditions.
- use data from a sample to answer questions about a population.
- summarize and describe data.
- allow you to generalize beyond the data at hand.
Question 3 out of 3.
Which of the following are descriptive statistics?
- The mean age of people in Detroit.
- The number of people who watched the superbowl in the year 2002.
- A prediction of next month's unemployment rate.
- The median price of new homes sold in Miami.
- The height of the tallest woman in the world.
Answers
- The two divisions of statistics are inferential and descriptive. (See the text for more details.)
- Descriptive statistics summarize and describe data. Inferential statistics use data from a sample to answer questions about a population. Inferential statistics involves generalizing beyond the data at hand.
- Descriptive statistics are numbers that are used to summarize and describe data. Predicting next month's unemployment rate involves predicting future data, no describing the data at hand.
Inferential Statistics
Learning Objectives
- Distinguish between a sample and a population
- Define inferential statistics
- Identify biased samples
- Distinguish between simple random sampling and stratified sampling
- Distinguish between random sampling and random assignment
Populations and samples
In statistics, we often rely on a sample - that is, a small subset of a larger set of data - to draw inferences about the larger set. The larger set is known as the population from which the sample is drawn.
Example #1: You have been hired by the National Election Commission to examine how the American people feel about the fairness of the voting procedures in the U.S. Whom will you ask?
It is not practical to ask every single American how he or she feels about the fairness of the voting procedures. Instead, we query a relatively small number of Americans, and draw inferences about the entire country from their responses. The Americans actually queried constitute our sample of the larger population of all Americans. The mathematical procedures whereby we convert information about the sample into intelligent guesses about the population fall under the rubric of inferential statistics.
A sample is typically a small subset of the population. In the case of voting attitudes, we would sample a few thousand Americans drawn from the hundreds of millions that make up the country. In choosing a sample, it is therefore crucial that it not over-represent one kind of citizen at the expense of others. For example, something would be wrong with our sample if it happened to be made up entirely of Florida residents. If the sample held only Floridians, it could not be used to infer the attitudes of other Americans. The same problem would arise if the sample were comprised only of Republicans. Inferential statistics are based on the assumption that sampling is random. We trust a random sample to represent different segments of society in close to the appropriate proportions (provided the sample is large enough; see below).
Example #2: We are interested in examining how many math classes have been taken on average by current graduating seniors at American colleges and universities during their four years in school. Whereas our population in the last example included all US citizens, now it involves just the graduating seniors throughout the country. This is still a large set since there are thousands of colleges and universities, each enrolling many students. (New York University, for example, enrolls 48,000 students). It would be prohibitively costly to examine the transcript of every college senior. We therefore take a sample of college seniors and then make inferences to the entire population based on what we find. To make the sample, we might first choose some public and private colleges and universities across the United States. Then we might sample 50 students from each of these institutions. Suppose that the average number of math classes taken by the people in our sample were 3.2. Then we might speculate that 3.2 approximates the number we would find if we had the resources to examine every senior in the entire population. But we must be careful about the possibility that our sample is non-representative of the population. Perhaps we chose an overabundance of math majors, or chose too many technical institutions that have heavy math requirements. Such bad sampling makes our sample unrepresentative of the population of all seniors.
To solidify your understanding of sampling bias, consider the following example. Try to identify the population and the sample, and then reflect on whether the sample is likely to yield the information desired.
Example #3: A substitute teacher wants to know how students in the class did on their last test. The teacher asks the 10 students sitting in the front row to state their latest test score. He concludes from their report that the class did extremely well. What is the sample? What is the population? Can you identify any problems with choosing the sample in the way that the teacher did?
In Example #3, the population consists of all students in the class. The sample is made up of just the 10 students sitting in the front row. The sample is not likely to be representative of the population. Those who sit in the front row tend to be more interested in the class and tend to perform higher on tests. Hence, the sample may perform at a higher level than the population.
Example #4: A coach is interested in how many cartwheels the average college freshmen at his university can do. Eight volunteers from the freshman class step forward. After observing their performance, the coach concludes that college freshmen can do an average of 16 cartwheels in a row without stopping.
In Example #4, the population is the class of all freshmen at the coach's university. The sample is composed of the 8 volunteers. The sample is poorly chosen because volunteers are more likely to be able to do cartwheels than the average freshman; people who can't do cartwheels probably did not volunteer! In the example, we are also not told of the gender of the volunteers. Were they all women, for example? That might affect the outcome, contributing to the non-representative nature of the sample (if the school is co-ed).
Simple Random Sampling
Researchers adopt a variety of sampling strategies. The most straightforward is simple random sampling. Such sampling requires every member of the population to have an equal chance of being selected into the sample. In addition, the selection of one member must be independent of the selection of every other member. That is, picking one member from the population must not increase or decrease the probability of picking any other member (relative to the others). In this sense, we can say that simple random sampling chooses a sample by pure chance. To check your understanding of simple random sampling, consider the following example. What is the population? What is the sample? Was the sample picked by simple random sampling? Is it biased?
Example #5: A research scientist is interested in studying the experiences of twins raised together versus those raised apart. She obtains a list of twins from the National Twin Registry, and selects two subsets of individuals for her study. First, she chooses all those in the registry whose last name begins with Z. Then she turns to all those whose last name begins with B. Because there are so many names that start with B, however, our researcher decides to incorporate only every other name into her sample. Finally, she mails out a survey and compares characteristics of twins raised apart versus together.
In Example #5, the population consists of all twins recorded in the National Twin Registry. It is important that the researcher only make statistical generalizations to the twins on this list, not to all twins in the nation or world. That is, the National Twin Registry may not be representative of all twins. Even if inferences are limited to the Registry, a number of problems affect the sampling procedure we described. For instance, choosing only twins whose last names begin with Z does not give every individual an equal chance of being selected into the sample. Moreover, such a procedure risks over-representing ethnic groups with many surnames that begin with Z. There are other reasons why choosing just the Z's may bias the sample. Perhaps such people are more patient than average because they often find themselves at the end of the line! The same problem occurs with choosing twins whose last name begins with B. An additional problem for the B's is that the “every-other-one” procedure disallowed adjacent names on the B part of the list from being both selected. Just this defect alone means the sample was not formed through simple random sampling.
Sample size matters
Recall that the definition of a random sample is a sample in which every member of the population has an equal chance of being selected. This means that the sampling procedure rather than the results of the procedure define what it means for a sample to be random. Random samples, especially if the sample size is small, are not necessarily representative of the entire population. For example, if a random sample of 20 subjects were taken from a population with an equal number of males and females, there would be a nontrivial probability (0.06) that 70% or more of the sample would be female. Such a sample would not be representative, although it would be drawn randomly. Only a large sample size makes it likely that our sample is close to representative of the population. For this reason, inferential statistics take into account the sample size when generalizing results from samples to populations. In later chapters, you'll see what kinds of mathematical techniques ensure this sensitivity to sample size.
More complex sampling
Sometimes it is not feasible to build a sample using simple random sampling. To see the problem, consider the fact that both Dallas and Houston are competing to be hosts of the 2012 Olympics. Imagine that you are hired to assess whether most Texans prefer Houston to Dallas as the host, or the reverse. Given the impracticality of obtaining the opinion of every single Texan, you must construct a sample of the Texas population. But now notice how difficult it would be to proceed by simple random sampling. For example, how will you contact those individuals who don't vote and don't have a phone? Even among people you find in the telephone book, how can you identify those who have just relocated to California (and had no reason to inform you of their move)? What do you do about the fact that since the beginning of the study, an additional 4,212 people took up residence in the state of Texas? As you can see, it is sometimes very difficult to develop a truly random procedure. For this reason, other kinds of sampling techniques have been devised. We now discuss two of them.
Random Assignment
In experimental research, populations are often hypothetical. For example, in an experiment comparing the effectiveness of a new anti-depressant drug with a placebo, there is no actual population of individuals taking the drug. In this case, a specified population of people with some degree of depression is defined and a random sample is taken from this population. The sample is then randomly divided into two groups; one group is assigned to the treatment condition (drug) and the other group is assigned to the control condition (placebo). This random division of the sample into two groups is called random assignment. Random assignment is critical for the validity of an experiment. For example, consider the bias that could be introduced if the first 20 subjects to show up at the experiment were assigned to the experimental group and the second 20 subjects were assigned to the control group. It is possible that subjects who show up late tend to be more depressed than those who show up early, thus making the experimental group less depressed than the control group even before the treatment was administered.
In experimental research of this kind, failure to assign subjects randomly to groups is generally more serious than having a non-random sample. Failure to randomize (the former error) invalidates the experimental findings. A non-random sample (the latter error) simply restricts the generalizability of the results.
Stratified Sampling
Since simple random sampling often does not ensure a representative sample, a sampling method called stratified random sampling is sometimes used to make the sample more representative of the population. This method can be used if the population has a number of distinct "strata" or groups. In stratified sampling, you first identify members of your sample who belong to each group. Then you randomly sample from each of those subgroups in such a way that the sizes of the subgroups in the sample are proportional to their sizes in the population.
Let's take an example: Suppose you were interested in views of capital punishment at an urban university. You have the time and resources to interview 200 students. The student body is diverse with respect to age; many older people work during the day and enroll in night courses (average age is 39), while younger students generally enroll in day classes (average age of 19). It is possible that night students have different views about capital punishment than day students. If 70% of the students were day students, it makes sense to ensure that 70% of the sample consisted of day students. Thus, your sample of 200 students would consist of 140 day students and 60 night students. The proportion of day students in the sample and in the population (the entire university) would be the same. Inferences to the entire population of students at the university would therefore be more secure.
Video
Questions
Question 1 out of 8.
Our data come from _______, but we really care most about ______.
- theories; mathematical models
- samples; populations
- populations; samples
- subjective methods; objective methods
Question 2 out of 8.
A random sample
- is more likely to be representative of the population than any other kind of sample.
- is always representative of the population.
- allows you to directly calculate the parameters of the population.
- all of the above are true.
- all of the above are false.
Question 3 out of 8.
When participants who arrive for a research study are put into treatment groups on the basis of chance,
- random sampling has occurred.
- random assignment has occurred.
- the statistical conclusions will also be absolutely correct.
- the research findings will be compromised because you should never randomly assign to groups.
Question 4 out of 8.
Uncertainty regarding conclusions about a population can be eliminated if you
a. use a large random sample.
b. obtain data from all members of the population.
c. depend upon the t-distribution.
d. both a and b.
Question 5 out of 8.
Which of the following is (are) true? Using a random sample
- is to accept some uncertainty about the conclusions.
- enables you to calculate statistics.
- is to risk drawing the wrong conclusions about the population.
- biases your results.
Question 6 out of 8.
A random sample is one
- that is haphazard.
- that is unplanned.
- in which every sample of a particular size has an equal probability of being selected.
- that ensures that there will be no uncertainty in the conclusions.
Question 7 out of 8.
Which of the following is a random sample of a college student body?
- Every fifth person coming out of the Campus Center between 8:30am and 10:00am.
- Lisa Meyer, Todd Jones, and Maria Rivera, whose ID numbers were picked from a table of random numbers.
- Every 20th person in the student directory.
- All are examples of random samples.
Question 8 out of 8.
A biased sample is one that
- is too small.
- will always lead to a wrong conclusion.
- will likely have certain groups from the population over-represented or under-represented due only to chance factors.
- will likely have groups from the population over-represented or under-represented due to systematic sampling factors.
- is always a good and useful sample.
Answers
- Samples; populations
We study a sample to allow us to draw inferences about the population. - All of the above are false.
Stratified sampling is more likely to be representative of the population than random sampling. - random assignment has occurred.
Random assignment has occurred because the decision as to which subject goes into which group is random. - The only way to eliminate uncertainty is to obtain data from the whole population. You can reduce uncertainty with a large sample.
- All of the above except "biases your results". Random sampling does not produce bias, which means systematic rather than random error.
- A random sample is defined as one in which every sample of a particular size has an equal probability of being selected.
- The correct choice is: Lisa Meyer, Todd Jones, and Maria Rivera, whose ID numbers were picked from a table of random numbers.
- will likely have groups from the population over-represented or under-represented due to systematic sampling factors.
Only when the sampling is systematically favoring one group or another is the sample biased. Random samples, although they can be different from the population, are not biased. Bias is defined by the procedure for drawing the sample, not by the result.