Unit 3: Sampling Distributions

3a. Apply the central limit theorem to approximate sampling distributions

  • Define the sampling distribution of a mean.

Think again about the difference between sample and population statistics. When you take a sample from a population and measure its sample mean, then take a second sample and measure its mean, and keep going, the set of sample means will have its own distribution.

The central limit theorem states that if the original population had a normal distribution, OR the sample size is sufficiently large (n = 30 as a rule of thumb, but it can be a bit lower if the original population is CLOSE to normal), then the sample means themselves will be normally distributed, with a mean \mu _{\overline{x}}\approx \mu _{x} and standard deviation of \sigma _{\overline{x}}\approx \frac{\sigma_{x} }{\sqrt{n}}. We often call this standard deviation the standard error.

Review this material in: 


3b. Describe the role of sampling distributions in inferential statistics

  • How does the process of finding probabilities differ for a normal random variable versus the sampling distribution?

There is a subtle difference between the probabilities we are finding now and those we were finding at the end of Unit 2. When we first learned about probabilities from normal distributions, we were solving problems such as: If the population has mean and standard deviation of 10 and 2 respectively, find the probability that a random variable will have a value between 11 and 14.

The difference here is subtle, but important: If the population has mean and standard deviation of 10 and 2 respectively, find the probability that the mean of a sample of 10 taken from this population will have a value between 11 and 14.

Another way to think of this is that earlier you were working with sample size n = 1, so the above equation for standard error is the same as the population standard deviation, since you are dividing by one. So you still find the Z score, but rather than divide by σ, you must divide by \sigma/{\sqrt{n}}. The rest of the process is the same, taking those Z scores and using the tables or technology to find the appropriate areas under the curve.

Review this material in: 


3c. Interpret and create graphs of a probability distribution for the mean of a discrete variable

  • What is the mean of a discrete variable and how is it similar/different from the mean of a set of numbers?

The mean of a discrete random variable (also known as the expected value) is equal to the mean of a set of numbers that comes directly from that discrete distribution. Of course, this assumes no randomness, so the two numbers might be slightly different.

Take a very simple discrete distribution: p(X = 0) = 0.5, p(X = 1) = 0.2, p (x = 5) = 3.0. By the rules for mean of a discrete distribution (see resources below), the mean would be 0.5(0) + 0.2(1) + 0.3(5) = 1.7.

Now, let's take 10 numbers that perfectly represent this distribution: 0, 0, 0, 0, 0, 1, 1, 5, 5, 5. The sum is 17 and 17/10 = 1.7. So we get the same number from taking the mean of the distribution vs taking the mean of numbers from that distribution. Of course if we truly sample from the above distribution, because of randomness we're not going to get those 10 exact numbers, so taking the mean of 10 numbers drawn randomly from that distribution won't be exactly 1.7.

Review this material in Introduction to Sampling Distributions.


3d. Describe a sampling distribution in terms of repeated sampling

  • What does it mean to take repeated samples from a population and what implication does this have?

Taking the means from repeated samples creates a sampling distribution. You can find the probability of \overline{x} in roughly the same way you found the probability of x in Unit 2.

The formula for finding a Z score is different because you divide by standard error, rather than the standard deviation.

The reason this is important is because statistics is ultimately about making predictions about populations based on a sample (inferential statistics). A major step for being able to calculate margins of error is not only to observe properties of distributions, but to see how the resulting samples behave.


3e. Define and compute the mean and standard deviation of the sampling distribution of population proportion p

  • What is the sampling distribution of a population proportion and how does it differ from the sampling distribution of a mean?
  • How is the process for finding probabilities different? Can we still use the Z distribution?

Similar to the sampling distribution of the mean, you can take repeated samples from a population where p is the proportion having a certain characteristic, and the sample proportions will be normally distributed if the number sampled each time is sufficiently large.

A good rule of thumb is that given the parameters n = sample size and p = population proportion, we can use the normal distribution if np and n(1 – p) are both greater than 10 and p is not too close to 0 or 1. With a very small or large value for p, the distribution becomes very right or left skewed, and the process for finding areas based on the Z score is unreliable.

Mean  \mu _{\widehat{p}}\approx p and standard deviation  \sigma _{\widehat{p}}\approx \frac{\sqrt{p(1-p)}}{\sqrt{n}}

Note that the formulas use the population proportion p rather than the sample proportion  \widehat{p} . This becomes an important difference later when you have to make approximations of the population based on samples, and you of course do not have access to the population parameters.

Review this material in Sampling Distribution of a Proportion and The Sample Proportion.


3f. Identify or approximate a sampling distribution based on the properties of the population

  • How do the properties of the population distribution affect the sampling distribution?

If you have a very large sample size, the short answer is that they do not. The sampling distribution will be bell-shaped and have a predictable mean and standard deviations.

This explanation does not hold for smaller standard deviations where the properties of the sampling distribution are far less predictable. In some cases we can come up with sampling distributions and properties through small-sample methods, but these examples are beyond the scope of this course.


3g. Compare and evaluate the sampling distributions of different sample sizes

  • What effect does changing the sample size have on the sampling distribution of a mean?

Keep two things in mind here:

  1. Unless the underlying population distribution is normal, the sample size must be sufficiently large for the sampling distribution to be normal.
  2. As we stated above, the standard error (standard deviation of the sample means) becomes smaller with a larger sample. More specifically, it decreases by a factor of the square root of the sample size. For example, if you multiply the sample size by 4, you must divide the standard error by 2 (the square root of 4).

Review this material in Introduction to Sampling Distributions.


3h. Compare and evaluate the performance of different estimators based on their sampling distributions

  • What is bias and how does it differ from accuracy? Can an estimator be biased yet accurate? Can it be inaccurate and unbiased?
  • Why is the sample mean considered an unbiased estimator for the population mean?

No estimator is perfect. We use an estimator like a sample mean to give us the best possible estimate of a population mean. But because of sampling error the sample mean will always change for different samples.

What sampling error refers to is that because we do not have access to the entire population, our sample statistics will differ from the population parameter, and from each other. If you have a population of size 1,000 with a population mean of 50, you can take 5 samples of size 20, and because each time you have a different sample, you will get a different sample mean each time, such as {49, 51, 50, 48, 55}.

An unbiased estimator will sometimes estimate too high, sometimes too low, but you will get a good estimate in the long run if you average them. In other words, the sample mean is just as likely to overestimate the population mean by 2 as it is to underestimate by 2.

A biased estimator might be more likely to overestimate rather than underestimate, or vice versa. Accuracy differs from bias in that it refers to the amount that the statistics will be different than the parameter. It's important, but not as much as bias. Accuracy is going to be better, in general, when we have larger samples and there's less variance (or standard deviation) in the population.

Review this material in Characteristics of Estimators.


Unit 3 Vocabulary

  • Bias
  • Central Limit Theorem
  • Estimator
  • Random variable
  • Sampling distribution
  • Standard error
  • Sampling error
  • Unbiased estimator
  • Z distribution