MA121 Study Guide

Site: Saylor Academy
Course: MA121: Introduction to Statistics
Book: MA121 Study Guide
Printed by: Guest user
Date: Friday, April 26, 2024, 7:45 PM

Description


Navigating the Study Guide


Study Guide Structure

In this study guide, the sections in each unit (1a., 1b., etc.) are the learning outcomes of that unit. 

Beneath each learning outcome are:

  • questions for you to answer independently;
  • a brief summary of the learning outcome topic;
  • and resources related to the learning outcome. 

At the end of each unit, there is also a list of suggested vocabulary words.

 

How to Use the Study Guide

  1. Review the entire course by reading the learning outcome summaries and suggested resources.
  2. Test your understanding of the course information by answering questions related to each unit learning outcome and defining and memorizing the vocabulary words at the end of each unit.

By clicking on the gear button on the top right of the screen, you can print the study guide. Then you can make notes, highlight, and underline as you work.

Through reviewing and completing the study guide, you should gain a deeper understanding of each learning outcome in the course and be better prepared for the final exam!

Unit 1: Statistics and Data

1a. Describe various types of sampling methods to data collection, and apply these methods

  • Explain the difference between descriptive and inferential statistics.
  • What is the difference between quantitative and qualitative data?
  • Define and describe the difference between the two types of quantitative data, discrete and continuous.
  • Define bias and inferential data.
  • How can researchers control for possible bias in samples?
  • Define sample and sample size.
  • Define three types of sampling: stratified, cluster and systematic.

Descriptive statistics provide facts about a set of data, which are often depicted in a graph. Where is the data centered? What is the mean? What is the median? Are the data centered or bell-shaped? Are the data skewed to the right or left. In other words do most of the values appear at different ends of the graph? These descriptions tell us facts about the data in front of us and about the entire population or sample we have chosen.

Inferential statistics refers to how we draw conclusions about a population of data, when we examine data in a sample. Inferential statistics is the more common method of statistics, since we rarely have the time or resources to survey or measure every item, member, or person in a given group or population. In statistics, differences among various types of data is important since they determine which types of tests or graphs are most useful for displaying data and answering certain questions.

Quantitative (or numeric) data describes data that are numeric or mathematical. We use quantitative data to calculate sums, averages, means, other types of statistics, and other mathematical operations.

Qualitative (or categorical) data describes data that are non-numeric. Qualitative data includes text, letters, and words. Qualitative data can also include numerical digits, but mathematical operations do not make sense or may be impossible in this usage.

For example, think of a phone number, zip code, or postal code. Although these designations consist of numerical digits (numbers) and may include dashes and parentheses, we do not use them to conduct calculations. We do not add up or calculate an average for the numbers in our telephone contact list or postal code. Phone numbers, zip codes, and postal codes are examples of qualitative data.

We can categorize quantitative data further as discrete or continuous data.

A discrete data set contains a fixed, or small, number of possible values. The classic example of a discrete data set is a six-sided die. You can only roll a one, two, three, four, five or six. You cannot roll a 2.8.

A continuous data set has a large or infinite number of possibilities. Consider the weight of a group of college students. When we discard extreme outliers, our range may be from 120 to 200 pounds. This dataset includes 61 possible values, assuming we measure to the whole pound. When a 285-pound football player enters our dataset, we must add an additional 85 possibilities, to account for 166 possible values, because the dataset is continuous. Although this number is not infinite, it is impractical to consider it discrete data.

Bias occurs when we deal with inferential data. For example, because we cannot practically consider every person in the United States as part of our dataset, we have to make sure our sample is representative of the entire population.

A common example of how things can go wrong is the 1936 U.S. presidential election when political polling was brand new. The Literary Digest magazine mailed thousands of survey cards to its readers and people it found in the phone book and car registration directory to poll whether Republican challenger Alf Landon would defeat the Democrat presidential incumbent Franklin D. Roosevelt. The responses predicted Landon would win in a landslide. What went wrong?

Their sample size was large enough and their mathematical calculations were probably correct. However, in 1936 phones, cars, and even magazine subscriptions, were considered luxury items. The magazine's sample disproportionately represented wealthier Americans who were more likely to vote Republican.

While it is difficult to eliminate sampling bias entirely, we can use different sampling methods to reduce it.

  1. Stratified sampling divides the population into sub-populations and takes a random sample of each group. For example, if you think Democrats, Republicans, and Independents are going to poll differently on an issue, and these groups each represent 30, 25, and 45 percent of the entire U.S. population, your sample should reflect the same proportion of representatives. You should choose 30 Democrats, 25 Republicans, and 45 independents at random and send the survey form to all of these 100 individuals.

  2. Researchers use cluster sampling when their population is already divided into representative groups. For example, a researcher might study buildings (or clusters) in an apartment complex, choose a certain number of buildings at random, and sample everyone in each building.

  3. Researchers use systematic sampling when they have a rough idea of population size, but lack a representative cluster to sample. For example, a researcher might poll every tenth person who walks through the door in a shopping mall or an inspector might examine every 20th item on an assembly line for quality.

Review this material in Basic Definitions and Concepts, Descriptive Statistics, Inferential Statistics, and Variables.


1b. Create and interpret frequency tables

  • What is a frequency table?
  • Define class, bin, and class intervals.
  • Define outlier.
  • Why is it important to include every possible value between the lowest and highest in the left side of a frequency table, even when that value is not represented or has no frequency? In other words, if the data you obtain from an experiment ranges from 1 to 5 and there are no 4s, why do you need to display a row for 4?
  • What do you do when there are too many values in a data set to give each variable its own row? Let's say the variable is yards rushing during a football game and the possible values are 51 to 218. Your space does not allow you to display more than 160 rows in your table. What do you do?

A frequency table lists every possible value of the random variable (every possible value) in a data distribution. It must include an interior value, even when that data point has no frequency. This is because one of the purposes of a frequency table is to display how varied the data is.

For example, if our values are 1, 2, 3, and 25, you should space the 25 so it is 22 points away from the three to illustrate how much of an outlier it is. This means you should include rows for 4 through 24, with a zero frequency for each.

If the distribution has too many possible values, you can group the values into class intervals (sometimes called classes or bins) and mark the frequency for each.

Let's return to our football example. You can display the number of players who have rushed 50 to 59 yards, 60 to 69 yards, and so on. If you treat this data as discrete, it will be a long, tedious table with mostly ones and zeros, since you would need to show the number of players who rushed 50 yards, the number of players who rushed 51 yards, the number of players who rushed 52 yards, and so on. Grouping or organizing the figures into classes of ten yards each is more concise and gives the reader a more comprehensive picture of the total data.

If you group your data into classes make sure:

  1. Each class is the same width, 
  2. Each class does not overlap, and 
  3. Each class accounts for every possible value.

For example, we could use 50 to 59 yards, 60 to 69 yards, and so on. Each class width would equal ten rushing yards. We should not use 50 to 55 yards, 56 to 70 yards, or 71 to 73 yards.

Note that we are assuming the measurements are in whole yards. If we had a measurement of 55.3 in the above grouping, we would violate our third rule (that each class accounts for every possible value) since 55.3 comes in between two classes. So you may have to group your data as 50 to 55, 55 to 60, and so on, so you would put 55.5, and even 55.99 in the first class or bin.

Your decision on how wide to make your class interval and how many intervals to include, is really a matter of personal preference. The course textbook refers to formulas like Sturges' Rule and Rice Rule which are easy to compute (the number of intervals equals the cube root of the number of observations). But you may need to make some adjustments since your result will probably not be a whole number (integer).

Do not worry too much about memorizing these rules, since statisticians do not even agree on which rule is best. Researchers generally recommend using 5 to 20 class widths, but this can vary depending on whether your data is homogeneous or heterogeneous. It might be a good strategy to put your data in groups of tens so you can classify each data point by its first digit.

Basically, if you have too few classes, your readers may not appreciate the diversity of your data. If you have too many classes, you will probably display a lot of ones and zeros which can become tedious for the reader. Experiment with a few options until you see one that makes sense to you and your audience.

Review this material in Histograms and Frequency Distribution.


1c. Display data graphically and interpret the following types of graphs: stem plots, histograms, and boxplots

  • Define a histogram. How does it differ from a bar graph? What are some rules for drawing histograms that do not apply to bar graphs?
  • How is a stem plot similar to and different from a histogram?
  • What is a dot plot?
  • What does a box plot represent graphically?

A histogram is a special type of bar graph researchers use to display quantitative distributions. A histogram differs from a bar graph because the horizontal axis is numeric: the horizontal axis of a bar graph represents qualitative data.

A histogram should follow the same three rules for frequency tables listed above. The numbers on the horizontal axis need to be in order if they are grouped into classes. This makes it easy for readers to recognize any data point that is an outlier, due to the distance of the outlier's bar from the other bars in the graph. You must include any data point or interval that has zero values, with no bar. It is also important that the bars do not touch. For consistency, data that is homogeneous (or heterogeneous) is organized in the same way on the histogram. Remember that a histogram's main purpose is to represent a frequency table graphically.

A stem plot (also called a stem-and-leaf plot) is similar to a histogram, except it includes the last digit of the actual data values (the leaves) above the stem of the first digit or digits. Researchers use a stem plot to display the actual data on their graph. Stem plots quickly convey the minimum, maximum, and median of the data points to their readers.

A stem plot is similar to a dot plot which uses dots or similar markings to represent data points. For example, if the bar height (frequency) of the histogram is seven, and the values are 50, 51, 52, 54, 55, 56, 59, the histogram will display a bar that is seven units high. A dot plot will have seven dots going up from the horizontal axis, and a stem plot will have "5" in the stem, with the digits 1, 2, 4, 5, 6, 9 written out to the right.

A box plot graphically displays the five-number-summary of a data set. The five numbers are: minimum, first quartile, median, third quartile, and maximum. We will review the five-number summary in later units, but think of the first and third quartiles as the median of the lower and upper halves of the distribution, respectively. In other words, the five numbers partition the data set into four quartiles, each with (roughly) the same number of data points.

Review this material in Histograms, Stem and Leaf Displays, Dot Plots, and Box Plots.


1d. Identify, describe, and calculate the following measures of the location of data: quartiles and percentiles

  • What is a percentile, and how is it related to or different from a quartile?
  • What is the relationship between quartiles and the median?
  • What is the five-number summary?
  • My calculator and Excel give me two different numbers for the 1st and 3rd quartile. Why is that?

The Pth percentile of a data set means that P% of the data falls below that number and (100−P)% fall above that number. For example, the 80th percentile is the number for which 80% of the data is below and 20% is above. The median (half the data below, half above) is also the 50th percentile. These are approximate, especially for small data sets. This is because, for example, if you have 20 data points, finding the 37th percentile is not going to be an exact number. Since there are only 20 numbers (assuming no repeats), you'll have a 35th percentile and the next number up will be the 40th.

The quartiles of a data set divide that data set into four roughly-equally-sized (number of data points) parts. Again, we say "roughly" because if the number of data points is somewhat small and is not a multiple of 4 (17 data points, for example), you are not going to get four quartiles of the same size. This is another reason why, as we state above, finding quartiles is approximate.

When you have very large data sets (100 or more), you can find theoretical percentiles by using the Normal Distribution (see Unit 2). The median is the same thing as the 2nd quartile. The 1st quartile divides the first half of the data in half, and the 3rd quartile splits the upper half of the data. The minimum, 1st quartile, median/2nd quartile, 3rd quartile, and maximum make up the boundaries of the four quarters of data, and are referred to as the five number summary.

Finding quartiles for small data sets can be tricky, and even a bit subjective, if the number of data points is not a multiple of four. In the set: {0, 3, 4, 6, 8, 11, 11, 13, 15, 17, 20, 25}, the median is 11, the first quartile (separating the 3rd and 4th point) is 5, the third quartile (separating the 9th and 10th point) is 16. This is relatively simple because we have 12 data points and we can evenly divide them into groups of 3: {0, 3, 4 || 6, 8, 11 || 11, 13, 15 || 17, 20, 25}.

However, let's say we have 15 data points: {1, 3, 4, 5, 7, 8, 10, 10, 11, 13, 15, 16, 19, 20, 25}. The 2Q/median is the 8th data point (10). What about the 1st quartile? If we include the median, the first half is {1, 3, 4, 5, 7, 8, 10, 10}, in which case the 1st quartile (the median of this set) is 6. If we exclude the median {1, 3, 4, 5, 7, 8, 10}, then the 1st quartile is 5. That is why different technologies may give you different answers. Which is correct? Well, both. Even Microsoft Excel has two different functions for quartiles, one inclusive and one exclusive. This discrepancy disappears as the data set gets very large. 

Review this material in Percentiles and Relative Position of Data.


1e. Identify, describe, and calculate the measures of the center of mean, median, and mode

  • Explain the differences between mean and median.
  • What does the center of distribution tell us?
  • When is the median a better measure of center than the mean?
  • When is mode preferable to the mean or median?

We calculate the mean by adding all of the data points and dividing the result by the size. The mean provides a rough idea of the center of the distribution. The median does this too, but disregards all data points except for the one or ones in the middle.

You can think of the median as more resistant to outliers. In other words, when a researcher adds a significantly higher or lower value to a data set, or if the data is right or left skewed, the mean will adjust accordingly, with a significant upward or downward effect. The median, on the other hand, disregards the high and low values, so adding extreme values will have a much smaller, if any, effect on the median.

The mode conveys the most common data type. It is different from the mean and median because it is the only measure of center we can use with qualitative data, since it does not require a calculation or computation.

Review this material in Mean, Median, and Mode; VarianceMeasures of Central Tendency, Median and Mean, and Mean and Median Demonstration.


1f. Identify, describe, and calculate the following measures of the spread of data: variance, standard deviation, and range

  • Why is range generally not a reliable measure of spread?
  • Why are measures of spread necessary? What critical information does the measure of center fail to provide?
  • What is the difference between variance and standard deviation?

Measures of spread are just as important as measures of center. The mean and median, for quantitative data, give us an idea about where the center of the distribution is.

The variance and standard deviation tell us about the spread of the data, or how varied or heterogeneous your data are. A variance of zero happens if all data points are equal. For example, the data sets {49, 50, 51} and {0, 50, 100} have the same mean and median (50), but the second set is much more varied. We quantify this variability with the measure of spread.

The variance equals the mean of the squared differences between each data point and the mean.

The standard deviation equals the square root of the variance. One of the reasons we compute this is to get the units back to the original data set. If the data points are in minutes, the unit for the variance would be "square minutes" which does not make sense.

The simplest measure to use is the range (maximum to minimum), but the range is generally not a good measure since it only takes the highest and lowest data points into account.

Consider Data Set A = {47, 48, 49, 50, 51} and Data Set B = {47, 48, 49, 50, 51, 100}.

The range goes from four to 53. Set B is certainly more varied, but the 100 is an outlier, so the change in standard deviation is less extreme. The standard deviation is 1.6 for set A and 20.9 for set B.

Review this material in Measures of Variability and Mean, Median, and Mode; Variance.


Unit 1 Vocabulary

  • Bar graph
  • Bell-shaped
  • Bias
  • Bins
  • Box plot
  • Centered
  • Classes
  • Class intervals
  • Cluster sampling
  • Continuous data
  • Descriptive statistics
  • Discrete data
  • Dot plot
  • First quartile
  • Five-number summary
  • Frequency table
  • Histogram
  • Horizontal axis
  • Inferential statistics
  • Leaf
  • Mean
  • Measure of center
  • Median
  • Mode
  • Outlier
  • Qualitative data
  • Quantitative data
  • Quartiles
  • Range
  • Representative
  • Resistant
  • Sample
  • Sampling method
  • Skewed left
  • Skewed right
  • Spread
  • Standard deviation
  • Stem
  • Stem plot
  • Stem and leaf plot
  • Stratified sampling
  • Systematic sampling
  • Third quartile
  • Variance
  • Variation

Unit 2: Elements of Probability and Random Variables

2a. Apply simple principles of probability, and use common terminology of probability

  • Define probability. Explain how to compute probability.
  • Explain the concept of equally likely outcomes.
  • Define a random variable. How is it different from variables you may have encountered in Algebra? Name the types of random variables.

Think about probability as the chance that an outcome or event will occur. The probability of something occurring is a number ranging between zero (zero percent or no chance) to one (100 percent or definitely).

We always express probability as a decimal for use in calculations. In other words, we write 55 percent as p = 0.55. If you have several events with equally likely outcomes then the number of possible outcomes is the denominator and the number of "successful" outcomes is the numerator.

For example, the probability of rolling greater than four on a six-sided die is 2/6, because rolling a five and six are the successes out of six outcomes. We cannot extend this to the sum of two dice. Totals of 2 through 12 make eleven possible events, but not all of them are equally likely. There is only one way {1,1} to roll a 2, but six ways {1,6}, {2,5}, {3,4}, {4,3}, {5,2}, {6,1} to roll a 7.

There are many variations of the probability formula for different situations and distributions, but we generally calculate probability as the number of "favorable" outcomes (that is, outcomes you are looking for) divided by the total number of possible outcomes.

A random variable differs from variables you have seen in algebra, because in those courses, the value of x is fixed and we solve its value by following certain steps.

For example, for the equation 2x = 4, x always equals two; it is just a matter of finding it. The study of probability introduces the concept of random variables: their value results from a probability experiment.

If x equals the number of times a coin lands on heads when you flip a coin five times, x will have a value from 0 to 5. We do not know the specific value until we have tossed the coin five times.

A random variable can be discrete (a specific set of possible values) or continuous (many or an infinite number of possible values). In our coin toss example above, x is a discrete random variable, since x can only have six possible values (0 to 5).

A continuous random variable might be a team's score during a basketball game. Just as we discussed above, there are too many possible values to make a frequency table for each possible point value.

Review this material in Remarks on the Concept of "Probability" and Random Variables.


2b. Calculate conditional probability, and determine whether two events are mutually exclusive and whether two events are independent

  • Define an outcome and an event. How are the two related?
  • What does it mean for two events to be dependent or independent?
  • What does it mean for two events to be mutually exclusive?
  • Define conditional probability.
  • Define a compound event.

An outcome is the result of a single experiment. Examples include flipping a coin, measuring someone's height, or asking someone to name their favorite baseball team.

Two events are independent when the occurrence of one event or action does not depend on the occurrence of another.

  • An example of independent events are: A = roll a five on a die, and B = flip heads on a coin. The two events do not have a causal relationship.
  • An example of dependent events are: A = the temperature is below freezing, and B = it snows. In this case, event B is more likely to happen if A occurs than if it does not.

Compound events are combinations of events; we can calculate the probabilities for them as well. Common conjunctions are "AND", "OR", and "GIVEN" (conditional probability). In our above example, "Below freezing AND snowfall" is an example of the compound event (A and B).

Mutually-exclusive events (also called disjoint events) are events that cannot occur at the same time.

  • An example of mutually-exclusive events are: A = roll greater than a four on a six-sided die, and B = roll less than three on a six-sided die. These events cannot occur at the same time.

Symbolically, we describe this as  P(A \cap B) = 0

We pronounce the notation  P(A | B) = 0 as "A given B" which means we are looking for the probability that A occurs given that B occurs, or has occurred. We call this situation conditional probability.

Review this material in Remarks on the Concept of "Probability" and Basic Concepts.


2c. Calculate probabilities using the addition rules and multiplication rules

  • Define the general addition rule of probability. How does it relate to whether events are mutually exclusive, or not?
  • Define the multiplication rule of probability. How does it relate to whether events are independent, or not?
  • Define the special addition rule, general multiplication rule, and special multiplication rule.

We generally associate the general addition rule of probability with "or" compound events. We associate the multiplication rule with "and" compound events.

The probability p(A or B) = p(A) + p(B) holds if A and B are mutually exclusive (cannot both occur).

If A and B are not mutually exclusive, the special addition rule holds: p(A or B) = p(A) + p(B) − p(A & B).

The general multiplication rule is p(A & B) = p(A) × p(B) and holds if A and B are independent.

If they are dependent events, that's when conditional probability comes in and we use the special multiplication rule: p(A & B) = p(A) × p(B | A).

Review this material in:


2d. Construct and interpret Venn diagrams

  • Define a union or an intersection of events.
  • Explain how to use Venn diagrams to illustrate outcomes and events.

The probability of a union of events  p(A\cup B) is the same as either A or B (or both) happening.

The probability of an intersection of events  p(A\cap B) is the same as saying that A and B both happen. We can represent these on a Venn diagram as events (circles) where outcomes are points in each circle. A union would be pictured as both circles shaded in, where an intersection would be represented as only the common area being shaded in.

Review this material in Probability with Playing Cards and Venn Diagrams and Addition Rule for Probability.


2e. Apply useful counting rules in the context of combinatorial probability

  • Define and explain the difference between a combination and a permutation.

A combination and permutation refer to the number of possible ways that x out of a possible n outcomes can occur. The difference is that the order does not matter in a combination, but order does matter in a permutation.

For example, there are ten ways to choose x = 2 out of the first n = 5 letters of the alphabet: AB, AC, AD, AE, BC, BD, BE, CD, CE, DE. If order does not matter, ten combinations are possible. You could reverse the order of any of the letters and it would not matter.

There would be 20 permutations if AB and BA were considered different. Without listing them all here, you can intuitively see this because you have ten combinations and each combination can be in two different orders (AB or BA), so 10 × 2 = 20 permutations.

Review this material in Permutations and Combinations.


2f. Identify and use common discrete probability distribution functions

  • Define a probability distribution. How does a probability distribution relate to the frequency tables we reviewed in Unit 1?
  • What is the difference between a discrete and continuous probability distribution?

A probability distribution consists of each possible value (or interval of values) of a random variable, and the probability that the variable will take on that value.

Probability distributions have many implications in decision making. When you interpret data, you will need to know the probability distribution the value you are trying to estimate follows. They are related to the frequency tables in that the probabilities are equivalent to what we call the relative frequency distribution.

If you have a list of data, you can get the probabilities on the right side of the table by dividing the frequency by the total number of data points.

  • For example, if the frequency of x = 3 is 7 in 20 die rolls, then the probability of rolling a three is 7/20 = 0.35, and so 0.35 would go across from value 3 in the table.

The difference between discrete and continuous probability distributions is analogous to the difference between discrete and continuous variables.

A discrete distribution (like a die roll) has a fixed, finite set of possible values for the random variable x, while a continuous distribution has many or infinitely many possible values for x. Like a frequency distribution, the values of x must be grouped into intervals. The same rules apply as those that apply to relative frequency histograms: all intervals must be equal width, non-overlapping, and all-inclusive.

Review this material in:


2g. Calculate and interpret expected values

  • What is the expected value of a distribution and how is it related to the mean of a set of data?

The expected value of a distribution is another name for the mean of the distribution.

This means that if you take a large set of numbers which follows the original distribution, the arithmetic mean (sum divided by n) of those numbers should roughly equal the expected value of the distribution.

We calculate the value of a distribution by multiplying each value of x by its probability (the median of the interval if it is continuous) and then summing up those numbers.

Review this material in Probability Distributions for Discrete Random Variables.


2h. Identify the binomial probability distribution, and apply it appropriately

  • What is a binomial experiment?
  • What is a binomial distribution? What characteristics do a set of events have to have to follow a binomial distribution?

A binomial experiment is an random experiment that has exactly two possible values (quantitative or qualitative) for x:

Flipping a coin: x = {heads, tails}

Answer to a true-false question: x = {true, false}

Free throw in basketball: x = {made, not made}

A binomial distribution is the distribution of the discrete random variable x, where x represents the number of successes out of n possible events. Note that we mean success in a generic sense: success may not be something positive. "Success" means the characteristic you are looking for or researching, regardless of whether you consider the outcome to be good or bad.

In our basketball free-throw example, if a player takes 10 free-throw shots, they will be successful zero to 10 times. Possible values for x include {0, 1, 2, 3, 4, 5, 6, 7, 8, 10}.

These values and their associated probabilities comprise a binomial probability distribution, which must have three criteria:

  1. All experiments are binomial.
  2. All experiments will succeed or fail independently of each other (that is, all free throws are independent events).
  3. All experiments have an equal probability of success.

A binomial distribution is a family of distributions where each specific distribution consists of two parameters: n = the number of experiments, p = the probability of success per experiment.

For our basketball example, if the player hits their free throws 75% of the time, we would define this distribution as binomial with n = 10 and p = 0.75. There are an infinite number of possible binomial distributions, with each combination of n & p making a unique distribution. 

TIP: Any specific distribution within a family of distributions is defined by its parameter(s). For example, binomial distributions have parameters n & p.

We can calculate the expected value of this distribution by multiplying n and p, or n × p.

Review this material in The Binomial Distribution and Binomial Distribution.


2i. Identify the Poisson probability distribution, and apply it appropriately

  • Define a Poisson distribution. When can you use it to estimate a binomial distribution? 
  • Describe a second general use for Poisson distribution In addition to approximating a binomial distribution.
  • A binomial distribution with n = 20 and p = 0.4 gives the same lambda (or mean), and the same Poisson distribution as a binomial distribution with n = 2000 and p = 0.004. How can the same Poisson distribution be "equivalent" to multiple binomial distributions?

The Poisson probability family of distributions (pronounced pwah-SOHN) has two main applications:

  1. The Poisson distribution is related to the binomial distribution because you can use it to approximate the binomial distribution when n is a large value and p is a small value. Unlike the binomial distribution, the Poisson distribution is easier to calculate because it has only one parameter (represented by the Greek letter lambda λ) and the expected value. The calculation is n * p, just as for binomial distribution.

  2. Statisticians refer to the Poisson distribution as the distribution of rare events. Suppose 1,000 cars drive on a section of road every day, and 1.2 of them get into an accident on average. Since the mean is so small compared to n, we can model this using a Poisson distribution with  \lambda =1.2 . We can use the formula to find the probability of x = 1 accident, x = 2 accidents, and so on.

    A minor difference between the Poisson distribution and binomial distribution is that the binomial distribution is a discrete distribution, where all possible values of x are between zero and n. The Poisson distribution is discrete in that x only has a fixed set of values: in theory it can have ANY whole number value from zero to infinity.

    For our car example, any probability for x greater than, let's say five, is so remote that it is effectively zero. Theoretically we could calculate the probability p(x = 150).

    How would you calculate Lambda? x = the expected number of occurrences during a fixed time period. So if a toll booth averages 45 cars per hour, and your fixed time period is 10 minutes, then the value of λ would be the expected number in 10 minutes, so 45 per 60 minutes would be 7.5. Note that although they are discrete, the expected value for the binomial and Poisson distributions do not have to be a whole number.

Finally, for each of the two distributions in the third question above lambda  \lambda =8 . However, you should see that finding the probability p(x = 7) gives the same answer for two different distributions. The Poisson distribution is a more accurate approximation for the second distribution than for the first. The Poisson distribution is a better predictor of the binomial distribution, the larger n gets (and the smaller p gets).

Review Poisson distribution in Poisson Distribution and Poisson Process 1 | Probability and Statistics | Khan Academy.


2j. Identify and use continuous probability density functions

  • What is the difference, in general, between computing probabilities from discrete vs. continuous distributions? 
  • Explain why one rule of continuous probability distributions is that there is zero probability that the random variable x equals any particular value.

In a discrete distribution, we can calculate the probability p(x = x) that x equals a particular value from a formula, depending on the distribution. We can also find probability in a range of values p(1 < x < 5) by computing p(x = 1) through p(x = 5) and adding the numbers. 

The difference between discrete and continuous distributions is that the probability p(x = x) that x equals a particular value is effectively zero. We can compute p(a < x < b) by finding the area under the density curve between x = a and x = b. It doesn't depend on the height of the graph.

This is why statisticians often call discrete distributions probability distribution functions and continuous distributions probability density functions. We use the term "density" because probabilities are based on the area between the two numbers, not the height of the graph. A probability density function, by definition, has an area of 1 (that is, it is unitless) under the entire curve.

We can explain this rationale in two ways. The most obvious reason is that p(x = x) is a single line which has no area if we calculate the probability that x is in a range of values by computing the area under the curve. The second reason is more conceptual. Since a continuous distribution has an infinite possible values of x, if we are talking about a uniform distribution between x = 0 and x = 1, an infinite number of numbers exist between those two values. Based on the definition of probability, since there are infinite possible outcomes, 1/∞ tends to 0.

Review this material in Continuous Random Variables.


2k. Identify the normal probability distribution, and apply it appropriately

  • Describe the characteristics of a normal distribution. What sets it apart from any other bell-shaped distribution?
  • What does it mean to be symmetric? Describe a uniform distribution.
  • In general, how do we calculate probabilities based on a normal distribution?
  • Define a standard normal distribution and what sets it apart from other normal distributions.

As we have discussed, probability distributions can be discrete or continuous. A continuous distribution can be symmetric if the density curve is symmetric around the median, which is also the mean.

A uniform distribution has a flat density curve. Another classic example of this is the discrete distribution where x = the sum of two six-sided die. There is only one way each to roll a 2 or a 12, but the median x = 7 has six possible ways: {1, 6}, {2, 5}, {3, 4}, {4, 3}, {5, 2}, and {6, 1}. The distribution has higher probabilities toward the mean/median and lower probabilities toward the edges.

A bell-shaped distribution has a bell-shaped density curve, like the dice distribution except continuous and graphically represented by a smooth curve.

Further in the hierarchy, we have the normal distributions which has all the characteristics above but with a few additional tell-tale characteristics, which we refer to as the empirical rule:

  1. The probability of x having a value between 1 standard deviation under to 1 standard deviation over the mean is about 68 percent.
  2. The probability of x being between −2 and +2 standard deviations is about 95 percent.
  3. The probability of x being between −3 and +3 standard deviations is about 99.7 percent.

There is an important reason why we PLURALIZE "normal distributions" above. As we said earlier, we define distributions by their parameters, such as n & p for binomial distributions. A given combination of mean  \mu and standard deviation  \sigma makes a particular normal distribution.

Finally, the standard normal distribution is a normal distribution with mean = 0 and standard deviation = 1. We will need to obtain a standard normal distribution (often referred to as the Z distribution) to calculate probabilities involving all normal distributions.

To find the probability of x being between a and b in a normal distribution, we must take the following steps:

  1. Convert the endpoint(s) into Z scores using the formula  Z=\frac{x-\mu}{\sigma}
  2. Use technology or a Z distribution table to look up the area to the left of b and the area to the left of a and subtract the two values.
  3. For p(x < a) convert a into a Z score and find the area left of that value.
  4. For p(x > b) convert b into a Z score, find the area left of that value and then subtract that number from 1.

In summary, the hierarchy is:

  1. Probability distribution
  2. Continuous probability distribution
  3. Symmetric
  4. Normal
  5. Standard normal (though discrete distributions can also be symmetric).

Review this material in:


Unit 2 Vocabulary

  • Addition (general and special) rules of probability
  • Binomial distribution
  • Binomial experiment
  • Combination
  • Compound event
  • Conditional probability
  • Continuous probability distribution and continuous random variable
  • Dependent and independent events
  • Discrete distribution and discrete random variable
  • Empirical rule
  • Event
  • Expected value of a distribution
  • Intersection
  • Multiplication (general and special) rules of probability
  • Mutually exclusive events
  • Normal distribution
  • Outcome
  • Parameters
  • Permutation
  • Poisson distribution
  • Probability density function
  • Probability distribution
  • Probability distribution function
  • Relative frequency distribution
  • Standard normal distribution
  • Symmetric distribution
  • Uniform distribution
  • Union
  • Venn diagram

Unit 3: Sampling Distributions

3a. Apply the central limit theorem to approximate sampling distributions

  • Define the sampling distribution of a mean.

Think again about the difference between sample and population statistics. When you take a sample from a population and measure its sample mean, then take a second sample and measure its mean, and keep going, the set of sample means will have its own distribution.

The central limit theorem states that if the original population had a normal distribution, OR the sample size is sufficiently large (n = 30 as a rule of thumb, but it can be a bit lower if the original population is CLOSE to normal), then the sample means themselves will be normally distributed, with a mean \mu _{\overline{x}}\approx \mu _{x} and standard deviation of \sigma _{\overline{x}}\approx \frac{\sigma_{x} }{\sqrt{n}}. We often call this standard deviation the standard error.

Review this material in: 


3b. Describe the role of sampling distributions in inferential statistics

  • How does the process of finding probabilities differ for a normal random variable versus the sampling distribution?

There is a subtle difference between the probabilities we are finding now and those we were finding at the end of Unit 2. When we first learned about probabilities from normal distributions, we were solving problems such as: If the population has mean and standard deviation of 10 and 2 respectively, find the probability that a random variable will have a value between 11 and 14.

The difference here is subtle, but important: If the population has mean and standard deviation of 10 and 2 respectively, find the probability that the mean of a sample of 10 taken from this population will have a value between 11 and 14.

Another way to think of this is that earlier you were working with sample size n = 1, so the above equation for standard error is the same as the population standard deviation, since you are dividing by one. So you still find the Z score, but rather than divide by σ, you must divide by \sigma/{\sqrt{n}}. The rest of the process is the same, taking those Z scores and using the tables or technology to find the appropriate areas under the curve.

Review this material in: 


3c. Interpret and create graphs of a probability distribution for the mean of a discrete variable

  • What is the mean of a discrete variable and how is it similar/different from the mean of a set of numbers?

The mean of a discrete random variable (also known as the expected value) is equal to the mean of a set of numbers that comes directly from that discrete distribution. Of course, this assumes no randomness, so the two numbers might be slightly different.

Take a very simple discrete distribution: p(X = 0) = 0.5, p(X = 1) = 0.2, p (x = 5) = 3.0. By the rules for mean of a discrete distribution (see resources below), the mean would be 0.5(0) + 0.2(1) + 0.3(5) = 1.7.

Now, let's take 10 numbers that perfectly represent this distribution: 0, 0, 0, 0, 0, 1, 1, 5, 5, 5. The sum is 17 and 17/10 = 1.7. So we get the same number from taking the mean of the distribution vs taking the mean of numbers from that distribution. Of course if we truly sample from the above distribution, because of randomness we're not going to get those 10 exact numbers, so taking the mean of 10 numbers drawn randomly from that distribution won't be exactly 1.7.

Review this material in Introduction to Sampling Distributions.


3d. Describe a sampling distribution in terms of repeated sampling

  • What does it mean to take repeated samples from a population and what implication does this have?

Taking the means from repeated samples creates a sampling distribution. You can find the probability of \overline{x} in roughly the same way you found the probability of x in Unit 2.

The formula for finding a Z score is different because you divide by standard error, rather than the standard deviation.

The reason this is important is because statistics is ultimately about making predictions about populations based on a sample (inferential statistics). A major step for being able to calculate margins of error is not only to observe properties of distributions, but to see how the resulting samples behave.


3e. Define and compute the mean and standard deviation of the sampling distribution of population proportion p

  • What is the sampling distribution of a population proportion and how does it differ from the sampling distribution of a mean?
  • How is the process for finding probabilities different? Can we still use the Z distribution?

Similar to the sampling distribution of the mean, you can take repeated samples from a population where p is the proportion having a certain characteristic, and the sample proportions will be normally distributed if the number sampled each time is sufficiently large.

A good rule of thumb is that given the parameters n = sample size and p = population proportion, we can use the normal distribution if np and n(1 – p) are both greater than 10 and p is not too close to 0 or 1. With a very small or large value for p, the distribution becomes very right or left skewed, and the process for finding areas based on the Z score is unreliable.

Mean  \mu _{\widehat{p}}\approx p and standard deviation  \sigma _{\widehat{p}}\approx \frac{\sqrt{p(1-p)}}{\sqrt{n}}

Note that the formulas use the population proportion p rather than the sample proportion  \widehat{p} . This becomes an important difference later when you have to make approximations of the population based on samples, and you of course do not have access to the population parameters.

Review this material in Sampling Distribution of a Proportion and The Sample Proportion.


3f. Identify or approximate a sampling distribution based on the properties of the population

  • How do the properties of the population distribution affect the sampling distribution?

If you have a very large sample size, the short answer is that they do not. The sampling distribution will be bell-shaped and have a predictable mean and standard deviations.

This explanation does not hold for smaller standard deviations where the properties of the sampling distribution are far less predictable. In some cases we can come up with sampling distributions and properties through small-sample methods, but these examples are beyond the scope of this course.


3g. Compare and evaluate the sampling distributions of different sample sizes

  • What effect does changing the sample size have on the sampling distribution of a mean?

Keep two things in mind here:

  1. Unless the underlying population distribution is normal, the sample size must be sufficiently large for the sampling distribution to be normal.
  2. As we stated above, the standard error (standard deviation of the sample means) becomes smaller with a larger sample. More specifically, it decreases by a factor of the square root of the sample size. For example, if you multiply the sample size by 4, you must divide the standard error by 2 (the square root of 4).

Review this material in Introduction to Sampling Distributions.


3h. Compare and evaluate the performance of different estimators based on their sampling distributions

  • What is bias and how does it differ from accuracy? Can an estimator be biased yet accurate? Can it be inaccurate and unbiased?
  • Why is the sample mean considered an unbiased estimator for the population mean?

No estimator is perfect. We use an estimator like a sample mean to give us the best possible estimate of a population mean. But because of sampling error the sample mean will always change for different samples.

What sampling error refers to is that because we do not have access to the entire population, our sample statistics will differ from the population parameter, and from each other. If you have a population of size 1,000 with a population mean of 50, you can take 5 samples of size 20, and because each time you have a different sample, you will get a different sample mean each time, such as {49, 51, 50, 48, 55}.

An unbiased estimator will sometimes estimate too high, sometimes too low, but you will get a good estimate in the long run if you average them. In other words, the sample mean is just as likely to overestimate the population mean by 2 as it is to underestimate by 2.

A biased estimator might be more likely to overestimate rather than underestimate, or vice versa. Accuracy differs from bias in that it refers to the amount that the statistics will be different than the parameter. It's important, but not as much as bias. Accuracy is going to be better, in general, when we have larger samples and there's less variance (or standard deviation) in the population.

Review this material in Characteristics of Estimators.


Unit 3 Vocabulary

  • Bias
  • Central Limit Theorem
  • Estimator
  • Random variable
  • Sampling distribution
  • Standard error
  • Sampling error
  • Unbiased estimator
  • Z distribution

Unit 4: Estimation with Confidence Intervals

4a. Explain the central limit theorem, and use it to construct confidence intervals

  • What role does the Central Limit Theorem have for estimation and sampling distributions?

We discussed the Central Limit Theorem in Unit 3. It is CRITICAL to note that if the distribution is non-normal, we MUST still have a large sample size. In other words, the Central Limit Theorem still applies. If it does not (non-normal distribution AND low sample size) then neither Z nor T will work.

The Central Limit Theorem still requires a normally distributed population or a sample size above 30. Using the T distribution instead of Z does not absolve us of these requirements.

Review the Central Limit Theorem in: 


4b. Compare t-distributions and normal distributions

  1. What is the difference between a normal distribution and a T distribution?
  2. Where would you need to use a T, instead of a normal (X), or the standard normal (Z) distributions?

Refer to learning objective 2k to review the progression from continuous distribution, to symmetric, to bell-shaped, to normal (X) and standard normal (Z).

Now, let's introduce the T distribution, which you can think of as a "brother" to the normal distribution in this hierarchy.

The Student's T distribution is similar to the normal distribution except that it is slightly shorter and flatter, with heavier tails than X or Z. In other words, the area to the right of two standard deviations in a normal distribution is 0.025. In a T distribution it will be larger. The T family of distributions is defined by a single parameter called the degrees of freedom which your book symbolizes simply as df although some texts will use other symbols like \nu. The degrees of freedom, when we are talking about the T distribution is equal to the sample size, minus one.

We use the T distribution in the same situations above as we use the Z, however we must use T instead when we are unsure of the shape of the population distribution, or we are using an estimate of the standard distribution (s instead of σ).

To find an area that is bound by the T distribution, we can use technology similar to what we use for the Z distribution. There is a T distribution table with one row each for most common values of T. The drawback is that we cannot use this to find the exact area, only the "T score" given a particular area.

Note that with a T distribution, as the sample size (thus the degrees of freedom) increases, it is nearly indistinguishable from the Z distribution. In fact, the Z distribution IS the T distribution with an "infinite" degrees of freedom! Using the T distribution gives us a larger margin of error to compensate for the fact that the exact standard deviation of the population is not known.

Review this material in T Distribution.


4c. Apply and interpret the central limit theorem for sample averages

  • We saw earlier that the Central Limit Theorem holds if we have either a large sample size OR an underlying normal distribution. Does this affect whether we should use a Z or T distribution for estimation, or whether we can use either at all?

Whenever we perform an estimation, we must use a T distribution for a small sample size or an unknown value of the population standard distribution. The formula for using T distribution to find the margin of error and confidence intervals is the same as the one for Z, except we substitute σ with s. The difference is that instead of finding the cutoff value for the z score in the formula, we have to find the t score instead. This introduces an extra parameter. In addition to using the alpha value for the area right of the tail, we need to use n−1 (one less than the sample size) for the parameter "df". This will result in a larger margin of error because, as we stated earlier, T distributions have a heavier tail. 

Review this material in The Sampling Distribution of the Sample Mean.


4d. Calculate, describe, and interpret confidence intervals for population averages and one population proportions

  • What does the confidence interval for a data sample tell you?
  • Why was it necessary for you to learn about sampling distributions first?

We use inferential statistics to interpret samples and make conclusions about the population of data. The general method for making these interpretations is roughly the same, whether it is for population averages, proportions, averages of two different populations, standard deviations, or any other statistic.

First, we find a point estimate which is usually the sample mean or proportion. Then, given the level of confidence desired, the sample size, and the standard deviation of the population, we can find a margin of error and subtract or add it to the point estimate to get a confidence interval for the population parameter.

When we refer to level of confidence we mean the likelihood that our confidence interval contains the true population mean or proportion (or other parameter). Common confidence levels are 90%, 95%, and 99%. A higher confidence level gives a higher likelihood that the interval contains the population parameter. However the price to be paid is that naturally a higher confidence interval will be wider.

So for example, we could say there is a 95 percent probability that the population mean is between 10 ± 2.8, [7.2, 10.8]. If we want to be 99 percent confident, we have to increase the width of the interval, perhaps 10 ± 3.2. A 100 percent confidence interval is not possible since that would require an infinitely wide margin of error. So there is a give and take. Choosing a confidence level can be more art than science. Balance your desire for accuracy with the need to keep the margin of error low.

The reason you had to learn about sampling distributions first is because inferential statistics involve predicting the characteristics of a population based on a sample. In order to do so, we must first study how those samples behave.

Use the given formulas to calculate the margin of error for the population mean, given a large population (this is when you use the standard normal Z distribution). If the population is small, or it is large and the standard deviation of the population is unknown, you must use the Student's T.

The formulas are very similar, except for the distribution. You would use the inverse Z distribution or inverse T distribution with given degrees of freedom. You also have to plug in the Z or T score corresponding to ∝/2, where ∝ is equal to the tail area on one side. If you, for example, are trying to find a 90 percent confidence interval, that will leave 10 percent area on the tails, which you divide in half to come up with ∝/2 = .10/2 = 0.05.

Review this material in:


4e. Interpret the student-t probability distribution as the sample size changes

  • How is the student-t distribution related to sample size? Why does this not matter for normal or standard normal distributions?
  • What happens to the student t distribution as the sample size increases? What distribution does it begin to resemble?

The student-t distribution, like the normal distribution, is a family of distributions defined by one or more parameters. A normal distribution is defined by its mean and standard deviation. The student-t distribution is defined by the number of degrees of freedom, which is equal to the sample size minus 1. 

The larger the sample size, the lower the margin of error, and the larger degrees of freedom, which combined will give you a smaller margin of error. The larger the degrees of freedom, the more the T distribution resembles the standard normal (Z) distribution. In fact, a Z distribution IS by definition the same as a T distribution with infinite degrees freedom!

Review this material in:


Unit 4 Vocabulary 

  • Central Limit Theorem
  • Confidence interval
  • Degrees of freedom
  • Level of confidence
  • Margin of error
  • Normal distribution
  • Point estimate
  • Standard normal distribution
  • Student's T distribution

Unit 5: Hypothesis Test

5a. Differentiate between type I and type II errors, and find the probability of these errors

  • What is hypothesis testing? How is it related to confidence intervals?
  • What is a null and alternate hypothesis?
  • What is an error in the context of hypothesis testing? What is the difference between a Type I and Type II error? How do they relate to each other? Which is more serious?
  • How are the Type I and Type II errors calculated or determined?

Hypothesis testing is a form of inferential statistics similar to confidence intervals. We have a null hypothesis (H0) which we assume to be true by default, and an alternative hypothesis (H1 or Ha) which we can prove or fail to prove, based on sample data.

We have three types of alternate hypothesis:

  1. A right-tailed test has an alternate hypothesis that is greater than the null,
  2. A left-tailed test has a lower alternate hypothesis,
  3. A two-tailed test tests for both higher and lower. While this may seem more convenient, the downside of a two-tailed test is that it reduces the power of the test, making it less likely to detect a change from the null hypothesis.

The conclusion of a hypothesis test is to reject or fail to reject, the null hypothesis. In other words, we assume the null is true, then we either find evidence (based on finding the p value) that it is false, or fail to find evidence that the null is false.

You can think about this situation as a jury trial. The null hypothesis is innocence, and the prosecutor tries to get a guilty verdict by providing evidence which makes the null hypothesis unlikely if all that evidence is true.

No hypothesis test is perfect, nor can it produce results that are absolutely, 100 percent reliable. This is because the samples are not completely like the population.

  • A Type I Error results when the null hypothesis is incorrectly rejected. For our jury example, this means the jury just convicted an innocent defendant.
  • A Type II Error is the opposite: failing to detect a difference from the null and incorrectly failing to reject the null hypothesis. For our jury example, the jury failed to convict a guilty defendant.

Remember: the term error does not necessarily mean the researcher made a mistake in their calculation. An error in statistics occurs when we do not have access to the population, and we may, by random chance, get a sample that does not properly represent the population.

For example, a drug study may show that a drug is ineffective simply because a large percentage of the sample the researcher used had a genetic tendency that made the drug less effective. The drug should have shown that it was more effective. A mistake did not cause the error. The sample was unusual. By random chance, the researcher simply chose a group that was less helped by the drug.

Type I and II Errors are related in that, all other things being equal, they are inversely related. The researchers chose the Type I Error (𝛂). They calculate the Type II Error (𝜷) based on possible alternate values for the mean, or whatever else they are estimating. When you decrease one, if all else is equal, you inevitably increase the other.

Calculating a Type II Error is beyond the scope of this course. However, this shows why researchers do not simply choose a tiny number for Type I Error: this would increase the Type II Error. Consequently, it will be harder to detect a true difference from the null. This creates a situation that is more serious although the severity depends on the situation.

For example, in our jury trial scenario we want to err on the side of not sending an innocent person to prison. Since this is represented by Type I Error, we might consider Type I Error to be more serious, and thus lower 𝛂, taking the chance that it will raise 𝜷.

If in a drug study, the null hypothesis is that a drug is safe for consumption, a Type II error would fail to find that the drug is dangerous, so that would be more serious. In this case, you might choose a more conservative value for alpha, even though it increases the probability that a safe drug will be rejected.

Review this material in:


5b. Describe and conduct hypothesis testing, calculate the p-value, and accept or reject the null hypothesis

  • What is the p value in hypothesis testing, what does it represent?
  • What does the p value tell you about whether to accept or reject the null hypothesis?

The p-value of a hypothesis test provides the key to getting the conclusion of the test. The p-value refers to the probability of obtaining a sample value equal to, or more extreme (see note below) than, the one we got, if we assume the null hypothesis is true.

A very low p-value (let's say 0.005) means that "if the defendant really is innocent, the probability that we could have obtained the blood and DNA evidence we did is extremely small". This is why a smaller p-value will cause us to reject the null hypothesis.

A very large p-value (usually greater than 0.10) means that, for example, "we assume by default the drug is ineffective ... there is a 10 percent chance we could have gotten the results we did even if the drug is ineffective". Well then this is less impressive, and might lead us to fail to reject the null hypothesis, since we have not found enough evidence to prove the drug effective. 

The proper cutoff (where below would reject the null, and above would fail to reject) is subjective. The standard is 0.05, but can be as low as 0.01 for an aggressive test, or as high as 0.10 for a more conservative test. See above where we talk about cases where a Type I or Type II error is more serious.

Note: The definition of extreme depends on whether we are conducting a right-tail test (the probability of a result higher than the result we got), a left-tail test (lower result), or a two-tail test (higher OR lower).

Review this material in:


5c. Explain how to conduct hypothesis tests for a single population mean and population proportion, when the population standard deviation is unknown; perform this task; and interpret the results

  • How do you know when to use the z or t distributions in a hypothesis test for the mean? What about for a proportion? 
  • Under what circumstances could neither z, nor t be used?

You would use either Z or T distributions based on the same criteria you would use to generate a confidence interval. If you know the population standard deviation, and you have a reasonably large sample size, you use the Z distribution form of the given equations. If the population standard deviation is unknown or the sample size small, you would use T.

Remember, that the Central Limit Theorem still applies. In other words if the sample size is small, you MUST have a normally distributed population, or else you cannot use either Z or T.

The steps for performing a p-value test are:

  1. Decide or know what value of Type I Error you will use (𝛂).
  2. Use the appropriate formula to calculate test statistic (or test value). The correct formula is determined by the parameter you are testing (mean, proportion, etc.) and within each, which distribution you are using (Z or T test for the mean).
  3. Use technology or a distribution table to look up the probability of getting a value more than the test value (right-tailed), less than the test value (left-tailed), or the combination of higher and lower (two-tailed). If you are running a two-tailed test, for example, and you get a test value of 1.85, you want to find p(Z > 1.85) + p(Z < −1.85). This probability is your p-value.
  4. Compare the p-value to alpha. If it is lower, reject the null hypothesis, if it is higher, we will fail to reject.

Review this material in:


Unit 5 Vocabulary

  • Alternative hypothesis
  • Central Limit Theorem
  • Error
  • Fail to reject the null hypothesis
  • Left-tailed test
  • Null hypothesis
  • P-value
  • Reject the null hypothesis
  • Right-tailed test
  • Test statistic
  • Two-tailed test
  • Type I Error
  • Type II Error

Unit 6: Linear Regression

6a. Discuss and apply basic ideas of linear regression and correlation

  • What is the correlation coefficient and what does it tell us?
  • How is the correlation related to the slope of a regression line? Do they tell us roughly the same thing?

The correlation coefficient is a measure of the linear relationship between two variables x & y. It is a number between −1 and 1, inclusive.

  • 1 means there is a perfect positive correlation. The scatter plot slopes upward in a straight line.
  • −1 means perfect negative correlation. The scatter plot slopes downward in a straight line.
  • 0 means there is no correlation, as if every x value produces a completely random value for y.

In this way, correlation is related to the regression line slope in that they both have the same sign. However if the points are in a straight line sloping upward, it will have a correlation of 1 regardless of the line's slope. Remember, the slope of a line can be any real number, where the correlation is capped between −1 and +1.

Review this material in:


6b. Identify the assumptions that inferential statistics in regression are based on

  • Why do we call the regression line the "least squares" regression line?
  • What conditions must be true of a sample of points to make the correlation or regression line statistically significant?

We calculate statistical significance in much the same way as we determine the mean or proportion (from a sample) statistically significant, as we reviewed in Units 4 and 5.

To find confidence intervals, we use the concepts and formulas the chapter Statistical Inferences about the Slope refers to. We conduct hypothesis testing for the slope in the same way as for any other statistic.

Remember that it is always best to find and interpret the correlation coefficient first. While correlation does not necessarily prove a causative relationship between the two variables, if the correlation is very low, it is unlikely that the regression line will be of any use.

As long as the line is non-vertical, you WILL always get a solution for the least-squares regression line. Think of the phrase, "garbage in, garbage out". If the slope is not significant, then the regression is useless. You can also review the resource Testing the Significance of the Correlation Coefficient to learn how to do a hypothesis test for a correlation coefficient. There is a hypothesis test for just about everything in statistics!

The general method behind the regression line formula is that we want to find the line as follows: Draw a vertical line between every point on the scatter plot and the regression line. Then make that one side of a square. The line that gives the lowest total area (i.e. the lowest sum of squares) will be considered the best fit. Thus "least squares" regression line. 

Review this material in:


6c. Compute the standard error of a slope

  • What does the standard error of a slope tell you?
  • How is the standard error computed?

The standard error for a slope tells you basically the same thing that any other standard error tells you. Remember back to Unit 3 where we first defined standard error which is the standard deviation of the sampling distribution. So the standard error for the mean is the standard deviation of a set of sample means. It is a sign of how reliable those samples are by how much samples vary. A low standard error will produce a narrower confidence interval and make it more likely to reject an incorrect null hypothesis here.

Carry this logic forward to the interpretation of a slope. The standard error may not tell you much by itself (its computation is more complex than for means and proportions) but it is a component of statistical inference involving the slope of a regression line. The formula for the standard error is complex, but you can find it here: Regression Slope Test.

Review this material in Statistical Inferences about the Slope.


6d. Test a slope for significance

  • How would we test a slope for significance? How does this relate to hypothesis testing?

As stated above, hypothesis testing works for the slope or correlation of a regression line in the same general way that it works for the mean and proportion: You have a null hypothesis of no significance (r = 0), and an alternative that is almost always two-tailed (r ≠ 0). You can use the formulas in the resources below to find the T-statistic and then use the same methods (as we used in Units 4 and 5) to find the p-value: the combined area of the right and tails formed by the positive and negative of that T-statistic.

Review this material in Statistical Inferences about the Slope.


6e. Construct a confidence interval on a slope

  • What should the confidence interval for a slope look like if the slope is significant? Is this similar to the significance test?

Remember the confidence interval gives the range of values that most likely contain the true parameter. In the case of the slope, we want a confidence interval that does NOT include 0. Because if the confidence interval is, say [−0.8, 2.1] then the slope could be positive or negative, which would cause us to conclude that the slope we found is not significant.

Review this material in Statistical Inferences about the Slope.


6f. Calculate and interpret the coefficient of determination and the correlation coefficient

  • What is the coefficient of determination and how is it calculated?
  • How is the correlation coefficient calculated and how is it related to the coefficient of determination?

The coefficient of determination, simply put, is the square of the correlation coefficient. It can be calculated that way. There is also a formula in your text to calculate this value, in case the coefficient is not already calculated. What the coefficient tells us in effect, is the proportion of the variable y that is explained by the variable(s) x. So if a correlation is 0.8 then the coefficient of determination is 0.64, telling us that roughly 64% of the dependent variable (y) is explained by the independent variable (x).

Review this material in:


Unit 6 Vocabulary

  • Coefficient of determination
  • Correlation coefficient
  • Least-squares regression
  • Scatter plot
  • Slope
  • Standard error
  • Y-intercept