Setting Up Hypotheses

Site: Saylor Academy
Course: MA121: Introduction to Statistics
Book: Setting Up Hypotheses
Printed by: Guest user
Date: Sunday, August 14, 2022, 9:28 PM

Description

This section discusses the logic behind hypothesis testing using concrete examples and explains how to set up null and alternative hypothesis. It explains what Type I and II errors are and how they can occur. Finally, it introduces one-tailed and two-tailed tests and explains which one you should use for testing purposes.

Introduction

Learning Objectives

  1. Describe the logic by which it can be concluded that someone can distinguish between two things
  2. State whether random assignment ensures that all uncontrolled sources of variation will be equal
  3. Define precisely what the probability is that is computed to reach the conclusion that a difference is not due to chance
  4. Distinguish between the probability of an event and the probability of a state of the world
  5. Define "null hypothesis"
  6. Be able to determine the null hypothesis from a description of an experiment
  7. Define "alternative hypothesis"

The statistician R. Fisher explained the concept of hypothesis testing with a story of a lady tasting tea. Here we will present an example based on James Bond who insisted that martinis should be shaken rather than stirred. Let's consider a hypothetical experiment to determine whether Mr. Bond can tell the difference between a shaken and a stirred martini. Suppose we gave Mr. Bond a series of 16 taste tests. In each test, we flipped a fair coin to determine whether to stir or shake the martini. Then we presented the martini to Mr. Bond and asked him to decide whether it was shaken or stirred. Let's say Mr. Bond was correct on 13 of the 16 taste tests. Does this prove that Mr. Bond has at least some ability to tell whether the martini was shaken or stirred?

This result does not prove that he does; it could be he was just lucky and guessed right 13 out of 16 times. But how plausible is the explanation that he was just lucky? To assess its plausibility, we determine the probability that someone who was just guessing would be correct 13/16 times or more. This probability can be computed from the binomial distribution, and the binomial distribution calculator shows it to be 0.0106. This is a pretty low probability, and therefore someone would have to be very lucky to be correct 13 or more times out of 16 if they were just guessing. So either Mr. Bond was very lucky, or he can tell whether the drink was shaken or stirred. The hypothesis that he was guessing is not proven false, but considerable doubt is cast on it. Therefore, there is strong evidence that Mr. Bond can tell whether a drink was shaken or stirred.


Let's consider another example. The case study Physicians' Reactions sought to determine whether physicians spend less time with obese patients. Physicians were sampled randomly and each was shown a chart of a patient complaining of a migraine headache. They were then asked to estimate how long they would spend with the patient. The charts were identical except that for half the charts, the patient was obese and for the other half, the patient was of average weight. The chart a particular physician viewed was determined randomly. Thirty-three physicians viewed charts of average-weight patients and 38 physicians viewed charts of obese patients.

The mean time physicians reported that they would spend with obese patients was 24.7 minutes as compared to a mean of 31.4 minutes for average-weight patients. How might this difference between means have occurred? One possibility is that physicians were influenced by the weight of the patients. On the other hand, perhaps by chance, the physicians who viewed charts of the obese patients tend to see patients for less time than the other physicians. Random assignment of charts does not ensure that the groups will be equal in all respects other than the chart they viewed. In fact, it is certain the two groups differed in many ways by chance. The two groups could not have exactly the same mean age (if measured precisely enough such as in days). Perhaps a physician's age affects how long physicians see patients. There are innumerable differences between the groups that could affect how long they view patients. With this in mind, is it plausible that these chance differences are responsible for the difference in times?

To assess the plausibility of the hypothesis that the difference in mean times is due to chance, we compute the probability of getting a difference as large or larger than the observed difference \mathrm{(31.4 \, -  \, 24.7  \, =  \, 6.7  \, \mathrm{minutes})} if the difference were, in fact, due solely to chance. Using methods presented in another section, this probability can be computed to be 0.0057. Since this is such a low probability, we have confidence that the difference in times is due to the patient's weight and is not due to chance.



Source: David M. Lane, https://onlinestatbook.com/2/logic_of_hypothesis_testing/intro.html
Public Domain Mark This work is in the Public Domain.

The Probability Value

It is very important to understand precisely what the probability values mean. In the James Bond example, the computed probability of 0.0106 is the probability he would be correct on 13 or more taste tests (out of 16) if he were just guessing.

It is easy to mistake this probability of 0.0106 as the probability he cannot tell the difference. This is not at all what it means.


The probability of 0.0106 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing). It is not the probability that a state of the world is true. Although this might seem like a distinction without a difference, consider the following example. An animal trainer claims that a trained bird can determine whether or not numbers are evenly divisible by 7. In an experiment assessing this claim, the bird is given a series of 16 test trials. On each trial, a number is displayed on a screen and the bird pecks at one of two keys to indicate its choice. The numbers are chosen in such a way that the probability of any number being evenly divisible by 7 is 0.50. The bird is correct on 9/16 choices. Using the binomial calculator, we can compute that the probability of being correct nine or more times out of 16 if one is only guessing is 0.40. Since a bird who is only guessing would do this well 40% of the time, these data do not provide convincing evidence that the bird can tell the difference between the two types of numbers. As a scientist, you would be very skeptical that the bird had this ability. Would you conclude that there is a 0.40 probability that the bird can tell the difference? Certainly not! You would think the probability is much lower than 0.0001.

To reiterate, the probability value is the probability of an outcome (9/16 or better) and not the probability of a particular state of the world (the bird was only guessing). In statistics, it is conventional to refer to possible states of the world as hypotheses since they are hypothesized states of the world. Using this terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.

This is not to say that we ignore the probability of the hypothesis. If the probability of the outcome given the hypothesis is sufficiently low, we have evidence that the hypothesis is false. However, we do not compute the probability that the hypothesis is false. In the James Bond example, the hypothesis is that he cannot tell the difference between shaken and stirred martinis. The probability value is low (0.0106), thus providing evidence that he can tell the difference. However, we have not computed the probability that he can tell the difference. A branch of statistics called Bayesian statistics provides methods for computing the probabilities of hypotheses. These computations require that one specify the probability of the hypothesis before the data are considered and, therefore, are difficult to apply in some contexts.

The Null Hypothesis

The hypothesis that an apparent effect is due to chance is called the null hypothesis. In the Physicians' Reactions example, the null hypothesis is that in the population of physicians, the mean time expected to be spent with obese patients is equal to the mean time expected to be spent with average-weight patients. This null hypothesis can be written as:

\mu_{\text {obese }}=\mu_{\text {average }}

or as

\mu_{\text {obese }}=\mu_{\text {average }}=0

The null hypothesis in a correlational study of the relationship between high school grades and college grades would typically be that the population correlation is 0. This can be written as

\rho=0

where \rho is the population correlation (not to be confused with r, the correlation in the sample).

Although the null hypothesis is usually that the value of a parameter is 0, there are occasions in which the null hypothesis is a value other than 0. For example, if one were testing whether a subject differed from chance in their ability to determine whether a flipped coin would come up heads or tails, the null hypothesis would be that \pi=0.5.

Keep in mind that the null hypothesis is typically the opposite of the researcher's hypothesis. In the Physicians' Reactions study, the researchers hypothesized that physicians would expect to spend less time with obese patients. The null hypothesis that the two types of patients are treated identically is put forward with the hope that it can be discredited and therefore rejected. If the null hypothesis were true, a difference as large or larger than the sample difference of 6.7 minutes would be very unlikely to occur. Therefore, the researchers rejected the null hypothesis of no difference and concluded that in the population, physicians intend to spend less time with obese patients.

If the null hypothesis is rejected, then the alternative to the null hypothesis (called the alternative hypothesis) is accepted. The alternative hypothesis is simply the reverse of the null hypothesis. If the null hypothesis

\mu_{\text {obese }}=\mu_{\text {average }}

is rejected, then there are two alternatives:

\mu_{\text {obese }} < \mu_{\text {average }}

\mu_{\text {obese }} > \mu_{\text {average }}

Naturally, the direction of the sample means determines which alternative is adopted. Some textbooks have incorrectly argued that rejecting the null hypothesis that two population means are equal does not justify a conclusion about which population mean is larger. Kaiser showed how it is justified to draw a conclusion about the direction of the difference.

Video

 

 

Questions

Question 1 out of 3.

Tommy claims that he blindly guessed on a 20-question true/false test, but then he got 16 of the questions correct. Using the binomial calculator, you find out that the probability of getting 16 or more correct out of 20 when \pi =.5 is 0.0059. This probability of 0.0059 is the probability that...

  • he would get 80\% correct if he took the test again.
  • he would get this score or better if he were just guessing.
  • he was guessing blindly on the test.


Question 2 out of 3.

Random assignment ensures groups will be equal on everything except the variable manipulated.

  • True
  • False


Question 3 out of 3.

The researchers hypothesized that there would be a correlation between how much people studied and their GPAs. The null hypothesis is that the population correlation is equal to

__________

Answers

  1. He would get this score or better if he were just guessing.
    If Tommy were guessing blindly, the probability that he would have gotten 16 out of the 20 questions right is.0059. This is NOT the probability that he was guessing blindly. Remember, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.

  2. False
    Chance differences will still exist.

  3. The null hypothesis says that any apparent effect is due to chance, so in this case, the null hypothesis would be that the population correlation was 0.

Type I and Type II Errors

Learning Objectives

  1. Define Type I and Type II errors
  2. Interpret significant and non-significant differences
  3. Explain why the null hypothesis should not be accepted when the effect is not significant

In the Physicians' Reactions case study, the probability value associated with the significance test is 0.0057. Therefore, the null hypothesis was rejected, and it was concluded that physicians intend to spend less time with obese patients. Despite the low probability value, it is possible that the null hypothesis of no true difference between obese and average-weight patients is true and that the large difference between sample means occurred by chance. If this is the case, then the conclusion that physicians intend to spend less time with obese patients is in error. This type of error is called a Type I error. More generally, a Type I error occurs when a significance test results in the rejection of a true null hypothesis.

By one common convention, if the probability value is below 0.05, then the null hypothesis is rejected. Another convention, although slightly less common, is to reject the null hypothesis if the probability value is below 0.01. The threshold for rejecting the null hypothesis is called the \alpha (alpha) level or simply \alpha. It is also called the significance level. As discussed in the section on significance testing, it is better to interpret the probability value as an indication of the weight of evidence against the null hypothesis than as part of a decision rule for making a reject or do-not-reject decision. Therefore, keep in mind that rejecting the null hypothesis is not an all-or-nothing decision.

The Type I error rate is affected by the α level: the lower the α level, the lower the Type I error rate. It might seem that α is the probability of a Type I error. However, this is not correct. Instead, α is the probability of a Type I error given that the null hypothesis is true. If the null hypothesis is false, then it is impossible to make a Type I error.

The second type of error that can be made in significance testing is failing to reject a false null hypothesis. This kind of error is called a Type II error. Unlike a Type I error, a Type II error is not really an error. When a statistical test is not significant, it means that the data do not provide strong evidence that the null hypothesis is false. Lack of significance does not support the conclusion that the null hypothesis is true. Therefore, a researcher should not make the mistake of incorrectly concluding that the null hypothesis is true when a statistical test was not significant. Instead, the researcher should consider the test inconclusive. Contrast this with a Type I error in which the researcher erroneously concludes that the null hypothesis is false when, in fact, it is true.

A Type II error can only occur if the null hypothesis is false. If the null hypothesis is false, then the probability of a Type II error is called \beta (beta). The probability of correctly rejecting a false null hypothesis equals 1- \beta and is called power. Power is covered in detail in another section.

Video

 

 

Questions

Question 1 out of 5.

It has been shown many times that on a certain memory test, recognition is substantially better than recall. However, the probability value for the data from your sample was.12, so you were unable to reject the null hypothesis that recall and recognition produce the same results. What type of error did you make?

  • Type I
  • Type II


Question 2 out of 5.

In the population, there is no difference between men and women on a certain test. However, you found a difference in your sample. The probability value for the data was.03, so you rejected the null hypothesis. What type of error did you make?

  • Type I
  • Type II


Question 3 out of 5.

As the alpha level gets lower, which error rate also gets lower?

  • Type I
  • Type II


Question 4 out of 5.

Beta is the probability of which kind of error?

  • Type I
  • Type II


Question 5 out of 5.

If the null hypothesis is false, you cannot make which kind of error?

  • Type I
  • Type II

Answers


  1. In this example, there is really a difference in the population between recognition and recall, but you did not find a significant difference in your sample. Failing to reject a false null hypothesis is a Type II error.

  2. There is no difference in the population, but you found a difference in your sample. A Type I error occurs when a significance test results in the rejection of a true null hypothesis.

  3. The Type I error rate is affected by the alpha level; the lower the alpha level is, the lower the Type I error rate gets. Alpha is the probability of a Type I error given that the null hypothesis is true.

  4. The probability of a Type II error is called beta. The probability of correctly rejecting a false null hypothesis equals 1- beta and is called power.

  5. A Type I error occurs when a significance test results in the rejection of a TRUE null hypothesis.

One- and Two-Tailed Tests

Learning Objectives

  1. Define Type I and Type II errors
  2. Interpret significant and non-significant differences
  3. Explain why the null hypothesis should not be accepted when the effect is not significant

In the James Bond case study, Mr. Bond was given 16 trials on which he judged whether a martini had been shaken or stirred. He was correct on 13 of the trials. From the binomial distribution, we know that the probability of being correct 13 or more times out of 16 if one is only guessing is 0.0106. Figure 1 shows a graph of the binomial distribution. The red bars show the values greater than or equal to 13. As you can see in the figure, the probabilities are calculated for the upper tail of the distribution. A probability calculated in only one tail of the distribution is called a "one-tailed probability".

Figure 1. The binomial distribution. The upper (right-hand) tail is red.

A slightly different question can be asked of the data: "What is the probability of getting a result as extreme or more extreme than the one observed?" Since the chance expectation is 8/16, a result of 3/16 is equally as extreme as 13/16. Thus, to calculate this probability, we would consider both tails of the distribution. Since the binomial distribution is symmetric when п=0.5, this probability is exactly double the probability of 0.0106 computed previously. Therefore, p=0.0212 . A probability calculated in both tails of a distribution is called a "two-tailed probability" (see Figure 2).

Figure 2. The binomial distribution. Both tails are red.

Should the one-tailed or the two-tailed probability be used to assess Mr. Bond's performance? That depends on the way the question is posed. If we are asking whether Mr. Bond can tell the difference between shaken or stirred martinis, then we would conclude he could if he performed either much better than chance or much worse than chance. If he performed much worse than chance, we would conclude that he can tell the difference, but he does not know which is which. Therefore, since we are going to reject the null hypothesis if Mr. Bond does either very well or very poorly, we will use a two-tailed probability.

On the other hand, if our question is whether Mr. Bond is better than chance at determining whether a martini is shaken or stirred, we would use a one-tailed probability. What would the one-tailed probability be if Mr. Bond were correct on only 3 of the 16 trials? Since the one-tailed probability is the probability of the right-hand tail, it would be the probability of getting 3 or more correct out of 16. This is a very high probability and the null hypothesis would not be rejected.

The null hypothesis for the two-tailed test is \pi = 0.5. By contrast, the null hypothesis for the one-tailed test is п \leq 0.5. Accordingly, we reject the two-tailed hypothesis if the sample proportion deviates greatly from 0.5 in either direction. The one-tailed hypothesis is rejected only if the sample proportion is much greater than 0.5. The alternative hypothesis in the two-tailed test is n \neq 0.5. In the one-tailed test it is \pi > 0.5.


You should always decide whether you are going to use a one-tailed or a two-tailed probability before looking at the data. Statistical tests that compute one-tailed probabilities are called one-tailed tests; those that compute two-tailed probabilities are called two-tailed tests. Two-tailed tests are much more common than one-tailed tests in scientific research because an outcome signifying that something other than chance is operating is usually worth noting. One-tailed tests are appropriate when it is not important to distinguish between no effect and an effect in the unexpected direction. For example, consider an experiment designed to test the efficacy of a treatment for the common cold. The researcher would only be interested in whether the treatment was better than a placebo control. It would not be worth distinguishing between the case in which the treatment was worse than a placebo and the case in which it was the same because in both cases the drug would be worthless.

Some have argued that a one-tailed test is justified whenever the researcher predicts the direction of an effect. The problem with this argument is that if the effect comes out strongly in the non-predicted direction, the researcher is not justified in concluding that the effect is not zero. Since this is unrealistic, one-tailed tests are usually viewed skeptically if justified on this basis alone.


Video

 

 

Questions

Question 1 out of 4.

Select all that apply. Which is/are true of two-tailed tests?

  • They are appropriate when it is not important to distinguish between no effect and an effect in either direction.
  • They are more common than one-tailed tests.
  • They compute two-tailed probabilities.
  • They are more controversial than one-tailed tests.


Question 2 out of 4.

You are testing the difference between college freshmen and seniors on a math test. You think that the seniors will perform better, but you are still interested in knowing if the freshmen perform better. What is the null hypothesis?

  • The mean of the seniors is less than or equal to the mean of the freshmen
  • The mean of the seniors is greater than or equal to the mean of the freshmen
  • The mean of the seniors is equal to the mean of the freshmen


Question 3 out of 4.

You think a coin is biased and will come up heads more often than it will come up tails. What is the probability that out of 22 flips, it will come up heads 16 or more times? (Write your answer out to at least three decimal places).

____________


Question 4 out of 4.

You think a coin is biased, and you are interested in finding out if it is. What is the probability that out of 30 flips, it will come up one side 8 or fewer times? (Write your answer out to at least three decimal places).

____________

Answers


  1. Two-tailed tests look for an effect in either direction, so they compute two-tailed probabilities. They are much more common than one-tailed tests in scientific research because an outcome signifying that something other than chance is operating is usually worth noting. Some people disagree with the use of one-tailed tests except in very specific situations.

  2. Because you are interested in the effect in either direction, you will use a two-tailed test. Thus, the null hypothesis is that the mean of the seniors is equal to the mean of the freshmen.

  3. This question is asking you to compute a one-tailed probability. Using the binomial calculator with the values of \mathrm{N}=22, \mathrm{p}=.5, and greater than or equal to 16, you get p=.0262.

  4. This question is asking you to compute a two-tailed probability. The probability that it will come up heads 8 or fewer times is.0081. Multiply that by 2 and you get p =.0162.