Chi-Square Distributions and Goodness of Fit

Site: Saylor Academy
Course: MA121: Introduction to Statistics
Book: Chi-Square Distributions and Goodness of Fit
Printed by: Guest user
Date: Friday, April 26, 2024, 12:50 PM

Description

Read these sections, which discuss chi-square distributions and how to test the goodness of fit. While these sections are optional, studying them may help you if you wish to take the Saylor Direct Credit exam for this course.

Chi Square Distribution

Learning Objectives

  1. Define the Chi Square distribution in terms of squared normal deviates
  2. Describe how the shape of the Chi Square distribution changes as its degrees of freedom increase

A standard normal deviate is a random sample from the standard normal distribution. The Chi Square distribution is the distribution of the sum of squared standard normal deviates. The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed. Therefore, Chi Square with one degree of freedom, written as x^{2}(1), is simply the distribution of a single normal deviate squared. The area of a Chi Square distribution below 4 is the same as the area of a standard normal distribution below 2, since 4 is 2^{2}.

Consider the following problem: you sample two scores from a standard normal distribution, square each score, and sum the squares. What is the probability that the sum of these two squares will be six or higher? Since two scores are sampled, the answer can be found using the Chi Square distribution with two degrees of freedom. A Chi Square calculator can be used to find that the probability of a Chi Square (with 2 df) being six or higher is 0.050.

The mean of a Chi Square distribution is its degrees of freedom. Chi Square distributions are positively skewed, with the degree of skew decreasing with increasing degrees of freedom. As the degrees of freedom increases, the Chi Square distribution approaches a normal distribution. Figure 1 shows density functions for three Chi Square distributions. Notice how the skew decreases as the degrees of freedom increases.


Figure 1. Chi Square distributions with 2, 4, and 6 degrees of freedom.


The Chi Square distribution is very important because many test statistics are approximately distributed as Chi Square. Two of the more common tests using the Chi Square distribution are tests of deviations of differences between theoretically expected and observed frequencies (one-way tables) and the relationship between categorical variables (contingency tables). Numerous other tests beyond the scope of this work are based on the Chi Square distribution.



Source: David M. Lane, https://onlinestatbook.com/2/chi_square/distribution.html
Public Domain Mark This work is in the Public Domain.

Video

 

 

Questions

Question 1 out of 4.

Imagine that you sample 12 scores from a standard normal distribution, square each score, and sum the squares. How many degrees of freedom does the Chi Square distribution that corresponds to this sum have?

______________


Question 2 out of 4.

What is the mean of a Chi Square distribution with 8 degrees of freedom?

______________


Question 3 out of 4.

Which Chi Square distribution looks the most like a normal distribution?

  • A Chi Square distribution with 0 df
  • A Chi Square distribution with 1 df
  • A Chi Square distribution with 2 df
  • A Chi Square distribution with 10 df


Question 4 out of 4.

Imagine that you sample 3 scores from a standard normal distribution, square each score, and sum the squares. What is the probability that the sum of these 3 squares will be 9 or higher?

______________

Answers

1 out of 4

The degrees of freedom of the Chi Square distribution are equal to the number of standard normal deviates being summed (which is 12 in this case).


2 out of 4

The mean of a Chi Square distribution is its degrees of freedom.


3 out of 4

As the degrees of freedom of a Chi Square distribution increase, the Chi Square distribution begins to look more and more like a normal distribution. Thus, out of these choices, a Chi Square distribution with 10 df would look the most similar to a normal distribution.


4 out of 4

Because three scores are sampled, the answer can be found using the Chi Square distribution with three degrees of freedom. A Chi Square calculator can be used to find that the probability of a Chi Square (with 3 df) being 9 or higher is .0293.

One-Way Tables (Testing Goodness of Fit)

Learning Objectives

  1. Describe what it means for there to be theoretically-expected frequencies
  2. Compute expected frequencies
  3. Compute Chi Square
  4. Determine the degrees of freedom

The Chi Square distribution can be used to test whether observed data differ significantly from theoretical expectations. For example, for a fair six-sided die, the probability of any given outcome on a single roll would be 1/6. The data in Table 1 were obtained by rolling a six-sided die 36 times. However, as can be seen in Table 1, some outcomes occurred more frequently than others. For example, a "3" came up nine times, whereas a "4" came up only two times. Are these data consistent with the hypothesis that the die is a fair die? Naturally, we do not expect the sample frequencies of the six possible outcomes to be the same since chance differences will occur. So, the finding that the frequencies differ does not mean that the die is not fair. One way to test whether the die is fair is to conduct a significance test. The null hypothesis is that the die is fair. This hypothesis is tested by computing the probability of obtaining frequencies as discrepant or more discrepant from a uniform distribution of frequencies as obtained in the sample. If this probability is sufficiently low, then the null hypothesis that the die is fair can be rejected.

Table 1. Outcome Frequencies from a Six-Sided Die.

Outcome Frequency
1 8
2 5
3 9
4 2
5 7
6 5


The first step in conducting the significance test is to compute the expected frequency for each outcome given that the null hypothesis is true. For example, the expected frequency of a "1" is 6 since the probability of a "1" coming up is 1/6 and there were a total of 36 rolls of the die.

\text { Expected frequency }=(1 / 6)(36)=6

Note that the expected frequencies are expected only in a theoretical sense. We do not really "expect" the observed frequencies to match the "expected frequencies" exactly.

The calculation continues as follows. Letting E be the expected frequency of an outcome and O be the observed frequency of that outcome, compute

\frac{(E-O)^{2}}{E}

for each outcome. Table 2 shows these calculations.


Outcome E O \frac{(E-O)^{2}}{E}
1 6 8 0.667
2 6 5 0.167
3 6 9 1.500
4 6 2 2.667
5 6 7 0.167
6 6 5 0.167


Next we add up all the values in Column 4 of Table 2.

\sum \frac{(E-O)^{2}}{E}=5.333

This sampling distribution of

\sum \frac{(E-O)^{2}}{E}

is approximately distributed as Chi Square with \mathrm{k}-1 degrees of freedom, where \mathrm{k} is the number of categories. Therefore, for this problem the test statistic is

\chi_{5}^{2}=5.333

which means the value of Chi Square with 5 degrees of freedom is 5.333.

From a Chi Square calculator it can be determined that the probability of a Chi Square of 5.333 or larger is 0.377. Therefore, the null hypothesis that the die is fair cannot be rejected.

This Chi Square test can also be used to test other deviations between expected and observed frequencies. The following example shows a test of whether the variable "University GPA" in the SAT and College GPA case study is normally distributed.

The first column in Table 3 shows the normal distribution divided into five ranges. The second column shows the proportions of a normal distribution falling in the ranges specified in the first column. The expected frequencies (E) are calculated by multiplying the number of scores (105) by the proportion. The final column shows the observed number of scores in each range. It is clear that the observed frequencies vary greatly from the expected frequencies. Note that if the distribution were normal, then there would have been only about 35 scores between 0 and 1, whereas 60 were observed.

Table 3. Expected and Observed Scores for 105 University GPA Scores.

Range Proportion E O
Above 1 0.159 16.695 9
0 to 1 0.341 35.805 60
-1 to 0 0.341 35.805 17
Below -1 0.159 16.695 19


The test of whether the observed scores deviate significantly from the expected scores is computed using the familiar calculation.

\chi_{3}^{2}=\sum \frac{(E-O)^{2}}{E}=30.09

The subscript "3" means there are three degrees of freedom. As before, the degrees of freedom is the number of outcomes minus 1, which is 4-1=3 in this example. The Chi Square distribution calculator shows that  p < 0.001 for this Chi Square. Therefore, the null hypothesis that the scores are normally distributed can be rejected.

Video

 

 

Questions

Question 1 out of 2.

You buy a bag of 40 lollipops. This bag has 4 different colors of lollipops in it. You are curious if all 4 colors were equally likely to be put in the bag or whether certain colors were more likely. If all four colors were equally likely to be put in the bag, what would be the expected number of lollipops of each color?

_________


Question 2 out of 2.

Suppose now that you open the lollipops to find out that you have 8 red, 5 green, 12 orange, and 15 blue. Test the null hypothesis that the colors of the lollipops occur with equal frequency. What is the Chi Square value you get?

_________

Answers


  1. If all four colors were equally likely to be put in the bag, then the expected frequency for a given color would be 1/4\mathrm{th} of the lollipops. So, the expected frequency would be (1 / 4)(40)=10. (Of course, this is the theoretical expected frequency, not what we actually expect the bag to look like).

  2. Take the sum of each \text { (expected - observed) }^{2} / \text { expected }=(10-8)^{2} / 10+(10-5)^{2} / 10+(10-12)^{2} / 10+(10-15)^{2} / 10=5.8