# BUS204 Study Guide

 Site: Saylor Academy Course: BUS204: Business Statistics (2018.A.01) Book: BUS204 Study Guide
 Printed by: Guest user Date: Wednesday, September 27, 2023, 6:15 PM

## Navigating this Study Guide

#### Study Guide Structure

In this study guide, the sections in each unit (1a., 1b., etc.) are the learning outcomes of that unit.

Beneath each learning outcome are:

• questions for you to answer independently;
• a brief summary of the learning outcome topic; and
• and resources related to the learning outcome.

At the end of each unit, there is also a list of suggested vocabulary words.

#### How to Use this Study Guide

1. Review the entire course by reading the learning outcome summaries and suggested resources.
2. Test your understanding of the course information by answering questions related to each unit learning outcome and defining and memorizing the vocabulary words at the end of each unit.

By clicking on the gear button on the top right of the screen, you can print the study guide. Then you can make notes, highlight, and underline as you work.

Through reviewing and completing the study guide, you should gain a deeper understanding of each learning outcome in the course and be better prepared for the final exam!

## Unit 1: Introduction to Statistical Analysis

### 1a. Explain the importance of statistics to business

• Why is it important for business students to study statistics?

Although you will very rarely be performing statistical studies or calculate conclusions from surveys in the business world, data is everywhere, and the business community is no exception. Is the increase in sales last year significant? Which of 5 laptop colors do consumers most prefer? If you're a business manager, you WILL be involved in data-based decisions, and you will likely be working with or supervising marketing or market research staff. In such cases, knowing at least some statistical concepts will be vital for knowing how well those teams are performing.

Also, in reading the results of statistical studies, you should understand concepts such as the p-value (the number that tells you the strength of a statistical hypothesis) to properly utilize the results of studies to make sound business decisions.

In a lot of ways, using statistics in business is similar to doctors running tests on patients. The test results themselves are usually computer-generated. The lab staff run the tests, but it takes a trained physician to interpret the results accurately. That will be similar to your role – you'll have to be able to read statistical results and know how best to use them.

Review Why Do We Need to Study Statistical Analysis as Part of a Business Program? for more details.

### 1b. Explain the differences between quantitative and qualitative data, and identify examples of each type of data

• What is the difference between quantitative and qualitative data? What is another name for each?
• Is a phone number considered quantitative or qualitative? It's made up of numbers so it must be quantitative, right? Same thing for a ZIP code.

In statistics, the difference between various data types is important, since it affects which types of tests and graphs are more useful for certain kinds of questions. Quantitative (also known as numeric) data is data that is mathematical in nature. With this kind of data, we can find the sum, average/mean, and other statistics that involve mathematical operations. Qualitative (also known as categorical) data is non-numeric. This kind of data is made of text, letters, words, or even digits, but in these cases, no mathematical operations are possible or make sense.

For example, consider phone numbers and ZIP codes. Although they are made of numbers (and sometimes dashes or parentheses), it would not make sense to do calculations with them. Of all the phone numbers stored in your phone, you would never want to find an "average phone number" or a "total zip code". For this reason, phone numbers and ZIP codes are considered qualitative data.

If you're wondering about the grammar involved, "data" is a plural term that refers to multiple numbers or points. The singular is "datum", but you only see that word used very rarely, since we are almost always looking at multiple numbers when performing statistical analysis.

Review Definitions of Statistics, Probability, and Key Terms for more details.

### 1c. Define and apply the following terms: data sets, mean, median, mode, standard deviation, variance, population and sample

• What is the difference between a measure of central tendency and a measure of spread?
• In what situations is using the median for a measure of center preferable to using the mean? In what situation is mode the only viable possibility?
• What's the difference between a population and a sample?
• What is the difference between standard deviation and variance?

A set is a collection of any number of items, and a data set is a collection of any number of data points.

Measures of central tendency include the mean, median, and mode, and describe the location of a data set or what a "typical" data point is within that set.

Measures of spread include the variance and standard deviation. They are just as important as measures of center, describing how homogenous (low spread) or heterogenous (high spread) the data is. Data with a lower spread is easier to do predictive (inferential) analysis on. Variance is the average square of the differences between a data point and the mean. We use the square of the difference instead of just the difference so that a data point far above the mean doesn't cancel out a data point far below the mean to show no difference. The square root of the variance is taken to give us the standard deviation. This helps to get the data back to its original units. If the original data is in minutes, then the unit of the variance will be "square minutes", which doesn't make sense, so taking the square root gets the unit back to minutes.

In some cases, you can use either the mean or median for measure of center, but if the data is right or left skewed (there are more outliers at the high or low ends of the distribution), the mean can be misleading since it is greatly affected by outliers. The median is more resistant to outliers so it will be more preferable when you have skewed (non-symmetric) data as opposed to a uniform distribution or bell curve.

The mode is the data point or points that occur most frequently. It is the only measure of center that can be used for qualitative data, since it's the only one that does not involve calculation. In an election of two or more candidates, the winner is the one whose vote total is the mode of all votes.

The population is the entire set of data that is being studied. In most cases, we don't have access to the population when we do statistical analysis. Because of this, we use one of several sampling methods to find a representative sample of the population. In each case, it is important to define your population and sample. If you find the average price of pizza in a small city by sampling every pizza restaurant, that will be the population, and you can find it by measuring data at every restaurant. However, this data could also serve as a sample if the population being studied is pizza prices in similarly-sized cities in the region.

Review section 2.2, section 2.3, and section 2.4 of the textbook for more details.

### 1d. Summarize and interpret data in a tabular format using frequency distributions and visually with histograms

• What is the difference between a frequency distribution and a relative frequency distribution?

A random variable is a variable that has an unknown value through randomness or experimentation, which differentiates it from an algebraic variable that is unknown until you solve for in an equation.

A random variable can be quantitative or qualitative. A quantitative random variable can be discrete, which means it has a small fixed list of possible values. A classic example is a six-sided die that has the only possible values of 1, 2, 3, 4, 5, or 6. A quantitative variable can also be continuous, meaning there is an infinite or nearly infinite possible number of values. For example, salaries in a large company might run from $20,000 all the way up to$500,000 figures, meaning we wouldn't be able to conveniently list all possible salaries. Even if there were only 5 people in a company whose salaries ranged from $30,000 to$100,000, there would still be 70,000 possible values. Even if the salaries were rounded to the nearest $100, we would still have 700 possible values, which is still way too many to list. A frequency distribution is a list of all possible values of a discrete random variable. If the variable is continuous, we list intervals of values rather than individual values, since it would not be possible to list all individual values. So, for the salary example above, we could group salaries into intervals of$10,000.

A relative frequency distribution is the same as a frequency distribution, except the total for each data value or interval is divided by the sample size, so that the relative frequencies are expressed as proportions of the whole and thus add to 1.

In the case of continuous variables, it is important to ensure that all intervals are of the same width, that every possible value between the lowest and highest value fits into exactly one interval, and that there are no overlapping intervals.

In a table, if a data value or interval of values has no data points, it must be represented in the table with a frequency of zero.

See section 2.1 of the textbook for more details.

### 1e. Describe data sets, create frequency tables, and draw histograms

• What is the difference between a bar graph and a histogram?
• Why is it important for the bars on a histogram to have equal width and spacing?

A histogram is a specific type of bar graph made for quantitative data, and is the graphical representation of a frequency or relative frequency distribution. In a bar graph, you can have qualitative data, so the bars do not need to be in any particular order.

A histogram, like a frequency distribution, has to follow similar rules: bars must be of equal width, and they must have no spaces between them. If a particular data value or interval doesn't have any data points in it, then a placeholder (or a zero-height bar) must be placed in its location on a histogram. This is done because if you have outliers in your data set, then placing them too close to the rest of the bars on the histogram will make the data seem more homogenous than it really is. The histogram needs to accurately reflect the shape of the data, so the scale must stay consistent.

A stem-and-leaf plot is similar to a histogram, except the intervals consist of the first digit(s) of a number and are listed down a column. The remaining digits are listed to the right of their corresponding number, at equal intervals for each row.

To review, see Shapes of Distributions. Pay particular attention to the Excel examples, where the option of "set gap width to 0" must be set. Also see section 2.1 of the textbook.

### Unit 1 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

• statistics
• measure of center
• mean
• median
• mode
• standard deviation
• variance
• population
• sample
• random variable
• quantitative
• qualitative
• discrete
• continuous
• frequency distribution
• relative frequency distribution
• histogram
• bar graph
• pie chart
• stem-and-leaf plot

## Unit 2: Counting, Probability, and Probability Distributions

### 2a. Identify values of and differentiate between permutations and combinations

• What is the difference between a combination and a permutation?

A combination is the number of ways that $x$ objects can be arranged out of a larger group of n objects in any order.

A permutation is the same, except that order matters. For example: If we are choosing $x=3$ letters from the first five ($n=5$) letters of the alphabet, there are 10 combinations and 60 permutations, since within each of the 10 combinations, the letters can be arranged 6 different ways: ABC, ACB, BAC, BCA, CAB, CBA. But, all six of them represent the same combination.

It is not only the order that matters. If the selected members can be assigned any unique characteristics, such as 3 people being elected President, Vice President, and Treasurer from a club of 20 people, we are looking at a permutation, since for any group of 3 members, it does matter which title each person has, just like it would matter what order they'd be in if they stood in a line.

For more outside of what we've already covered, see Permutations and Combinations.

### 2b. Explain and apply the different methods for determining probability: equally likely outcomes, frequency theory, and subjective theory

• Define probability.
• Describe the method of equally likely outcomes and how it can be used to find probabilities.

An outcome is a single possible result for an experiment. An event is made up of several outcomes. For example, {1,2,3,4,5,6} are the possible outcomes for a die roll. The event "Roll greater than a 4" is made of the outcomes {5,6}.

The set of possible outcomes for a probability experiment is called the sample space. A set is made of elements.

The fundamental definition of probability is the number of possible "successful" events or outcomes, divided by the total number of possible events or outcomes. So the probability of rolling a 5 on a 6-sided die is 1/6, because the "success" is rolling a 5 (one possible success) out of 6 possible outcomes. We can use this principle as long as each outcome is equally likely.

The intersection of two (or more) sets is the set whose elements appear in both (or all) sets. The union of two (or more) sets is the set whose elements appear in at least one of those sets. Example: if $A=\{1,2,3,4\}$ and $B=\{2,3,4,6,8\}$, then the union $\cup$ is $\{1,2,3,4,5\}$ (we do not repeat the 2 or 3), and the intersection $\cap$ is $\{2,3,4\}$. The empty set $\emptyset$ has no elements. If $E=\{1,2\}$ and $F=\{3,4\}$, then $E \cap F = \{ \emptyset \}$.

To review, see section 3.1 of the textbook.

### 2c. Define and apply the axioms of probability theory

• What is a compound event? How do unions and intersections of events relate to unions and intersections of sets?
• What is the difference between independent and dependent events?
• What is the difference between mutually exclusive and non-mutually exclusive events?
• Why are there two different formulas for $P(A\cap B)$?
• Why are there two different formulas for $P(A\cup B)$?
• Explain why for independent events $A$ and $B$, $P(A | B) = p(A)$.

Compound events happen when two or more events are connected by "AND" (intersection) or "OR" (union). $P(A\cap B)$ is the probability of both events $A$ and $B$ occurring. $P(A\cup B)$ is the probability of either $A$ OR $B$ (or both) occurring. The events can be represented as circles in a Venn diagram, and the outcomes as data points within those events. Note that a circle can also be used to represent a set of data points. $P(A)$, where event $A$ includes outcomes $A_1$, $A_2$, and $A_3$ is the same thing as finding the proportion of the number of outcomes in set $A$ divided by the number of all outcomes in the sample space.

Two events are independent if the occurrence of one does not depend on the occurrence of the other. Examples of independent events are $A=$"Roll a 5 on a die" and $B=$"Flip heads on a coin". The two have no causal relationship. Dependent events are $A=$"The temperature is below freezing" and $B=$"It snows". In this case, event $B$ is much more likely if $A$ occurs than if it doesn't.

Mutually exclusive events (also called disjoint events) are events that cannot both occur. Example: $A=$"Roll greater than a 4 on a six-sided die"; $B=$"Roll less than 3 on a six-sided die". These are mutually exclusive, because they cannot both happen. Symbolically, this is described as $P(A \cap B)=0$

The notation $P(A | B)$ is pronounced "$A$ given $B$". It means that we are looking for the probability that $A$ occurs given that $B$ occurs. This is called conditional probability.

The addition rule is used to find the probability of an "or" compound event.

• $P(A \cup B)=P(A)+P(B)-p(A\cap B)$
• If $A$ and $B$ are mutually exclusive, then the last term is 0 and the formula reduces to
$P(A \cup B)=P(A)+P(B)$

The multiplication rule is used to find "and" probabilities:

• $P(A \cap B)=P(A)\times P(B|A)$ is the general multiplication rule, and it works for all events.
• $P(A \cap B)=P(A)\times P(B)$ is the special multiplication rule, and it works for independent events. Note that if the events are independent, $P(B|A)=P(B)$.

To review, see section 3.2, section 3.3, and section 3.5 of the textbook, and see Addition Rule for Probability.

### 2d. Apply probability distributions and explain the properties of different distributions

• What is a random variable? What is the difference between a discrete and a continuous random variable?
• How is a probability distribution similar or different from a relative frequency distribution?
• What are some uses for probability distributions?
• What is sampling error and why does it occur? Does it imply a mistake in data gathering?

Remember: a random variable is a variable that has its value as a result of an experiment or survey as opposed to solving an equation. A discrete random variable has a fixed number of possible values. A continuous random variable cannot easily have all its possible values listed.

A probability distribution consists of each possible value (or interval of values) of a random variable, and the probability that the variable will take on that value. Probability distributions have many implications in decision making. When you interpret data, you will need to know what probability distribution the value you're trying to estimate follows.

As an example, if you think that 30% of consumers might be interested in a particular product, and sample data from a small group shows 26%, you would conduct a statistical test (see Unit 5) what the probability is that a confidence interval of sample proportions would include 26% if the true value is 30%. Remember, even if the true population proportion is 30%, because of random sampling error, proportions of samples will likely differ from 30%, so we have to find if the 26% is a significant enough difference to cast doubt on the hypothesis of 30%.

Sampling error occurs when conducting statistical tests, because taking samples of a population will result in different sample means, since each sample is unique. For example, if the mean of a population is 10, then the sample means should at least cluster around 10, but may not be exactly 10.

To review, see Probability Density Functions and Random Variables.

### 2e. Solve problems using binomial distribution, and explain when it should be used

• What properties must be true of all the experiments in an event to make that a binomial event?
• When calculating binomial probabilities, why is it important to use the formula for calculating the number of combinations?
• Why is it important that the probabilities of success be equal for each experiment?
• Why is it important that the experiments are independent of each other?

A binomial event has the following properties:

• It consists of $n$ experiments, each of which can have 2 possible outcomes.
• All of the experiments succeed or fail independently of each other, with equal probability $P$.

Example: Roll a 6-sided die $n=10$ times. The probability $P$ of rolling a 6 is $\frac{1}{6}$. Because the experiment (die roll) is performed a fixed number of times, each time the probability of "success" (rolling a 6) is equal, there are two possible outcomes (6 or not 6), and each die roll is independent of all the others, this qualifies as a binomial experiment.

Each experiment will result in either a "success" or "failure" (two possible outcomes). Success is simply defined as the result we are looking for, whether it is a good or bad occurrence.

A binomial random variable represents the number of successes x out of n events. Example: If $n=10$ experiments, then the random variable $x$ can have values $\{0,1,2,...,9,10\}$.

To review, see section 4.2 of the textbook.

### 2f. Differentiate between discrete and continuous probability distributions

• What is the difference between a discrete and a continuous distribution?
• Is the Binomial Distribution considered discrete or continuous?

Discrete probability distributions can have all possible outcomes listed. The random variable is discrete.

Continuous probability distributions are made up of possible intervals of values for a continuous random variable and the individual values of the random value cannot all be listed – but the intervals they're grouped into can be.

One property of a continuous random variable $x$ is that $P(X=x) = 0$. In other words, when finding probabilities involving continuous distributions, we only work with finding probabilities of $x$ being in an interval of values, just as a continuous frequency distribution only uses intervals of possible values.

Examples of discrete distributions are uniform-discrete, binomial, and poisson distributions. Examples of continuous distributions are uniform-continuous and normal distributions.

To review, see Probability Density Functions and Random Variables.

### 2g. Apply expected value and calculate it for various probability distributions

• Why is the mean of a distribution often referred to as its expected value?
• If a set of 20 binomial experiments are continuously run, where each experiment has 0.25 probability of success, and sets of 20 experiments are performed continuously and the number of successes out of 20 recorded, what should we expect those numbers to average (arithmetic mean) to?

Expected value, also called the mean of a probability distribution, is the expected long-term arithmetic mean of random variables following that distribution. For example, if a binomial event consists of 6 experiments, each with $P=\frac{1}{3}$ chance of success, then the expected value of that distribution will be one third of 6, or 2 successes. For a Binomial distribution, the expected value or mean is equal to $n \times P$.

Calculating the expected value of a distribution will in theory give you the same number as if you took random numbers with that distribution and calculated the arithmetic mean. In the example above, where $n=6$ and $P=\frac{1}{3}$, if you conduct the 6 experiments over and over again, the number of successes might be 1,1,1,2,2,2,2,2,2,3,5. The arithmetic mean of these numbers is 2.3.

To review, see section 3.1 of the textbook. This passage refers to the long term probability of an event. Expected value is the long-term mean of random variables occurring with that probability.

### Unit 2 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

• combination
• permutation
• set
• outcome
• event
• probability
• union
• intersection
• compound events
• mutually exclusive events
• independent events
• conditional probability
• multiplication rule
• random variable
• continuous random variable
• discrete random variable
• probability distribution
• binomial random variable
• discrete probability distribution
• continuous probability distribution
• expected value
• sampling error

## Unit 3: The Normal Distribution

### 3a. Define and apply the properties of the normal distribution while understanding real-world implications and applications of the normal distribution

• What are the properties of a normal distribution?
• Why do you think these distributions have the name "normal"?
• We've stated before that the probability of any continuous random variable EQUALING x be effectively 0. Knowing the properties of the normal distribution, why does this have to be the case?

Every data point in a distribution has a corresponding Z-score, which signifies the number of standard deviations above (positive) or below (negative) the mean.

Note first that the normal distribution is not a single distribution, but a family of distributions with those properties. Every combination of mean and standard deviation is considered its own distribution.

A normal distribution is a continuous distribution with the following properties:

• The distribution is symmetric and the histogram is bell-shaped.
• The mean, median, and mode are all equal and at the center of the distribution and bell curve.
• The location, or center, is defined by the mean, the spread or variation is defined by the standard deviation. Each normal distribution is defined by its mean $\mu$ and standard deviation $\sigma$.

The data is roughly distributed as follows:

• 68% of all data points are within one standard deviation of the mean, or have a Z-score between -1 and +1
• 95% of all the data are within two standard deviations of the mean, or have a Z-score between -2 and +2.
• 99.7% of all the data are within three standard deviations of the mean, or have a Z-score between -3 and +3.

These guidelines are known as the empirical rule.

By definition, the total area under a normal curve is 1. Finding the probability that a normal random variable $x$ is between two values (remember that a property of continuous distributions is that the probability of $x$ having any single value is effectively 0) is done by finding the area under the normal distribution curve between those two values. This is a more intuitive explanation of why the probability of a single value has to be zero.

Many real-life populations are normally distributed: heights of people, for example, or heights of other animals within their own species. In a large enough class, exam scores can be expected to follow a roughly normal distribution. Even some non-scientific data tends to normality:

As with any real-life case, a perfectly normal distribution is rare.

To review, see Qualitative Sense of Normal Distributions.

### 3b. Use the normal distribution to estimate the probability of an event occurring

• What is the difference between a normal distribution and the standard normal distribution?
• Why does the original distribution you want to find probabilities for first need to be converted to the standard normal before using a table to look up values?
• Most standard normal distribution tables give the area to the left of a particular Z-score. How would you find the area to the right of that Z-score? What about between two Z-scores if you are only given area to the left of each one?

Finding the probability that a normal random variable $x$ is between two values (remember that a property of continuous distributions is that the probability of $x$ having any single value is effectively 0) is done by finding the area under the normal distribution curve between those two values.

Finding the area is done using a computer or tables. In order to use the tables, the probability distribution being considered must be converted to the standard normal distribution.

The standard normal distribution (now we can say "the") is the normal distribution with a mean $\mu=0$ and $\sigma=1$.

Convert the interval of values you're seeking the area between two Z-scores and then use either a table of standard normal values or a calculator or app to calculate the probability.

This process can be reverse-engineered to find the percentile of a distribution (What percent of values fall below that value?). Find the Z-score that would have that area to its right or left, then convert that Z-score to the original distribution.

To review, see section 6.2 of the textbook.

### 3c. Explain how the normal distribution relates to the central limit theorem

• What is the difference between the following problems: "Find the probability that $x$ falls between $a$ and $b$" and "Find the probability that the mean of a sample taken from that distribution falls between $a$ and $b$".
• If a population distribution is not normal, yet you want to find the probability involving the mean of a sample taken from that distribution, what must be true about your sample?

The sampling distribution of the mean has a mean and standard distribution of a sample of data drawn from a larger distribution. If you sample n items from a distribution repeatedly and record the sample means, those sample means themselves will be normally distributed with a mean equal to the population mean and standard deviation (also called the standard error) equal to the population standard deviation divided by the square root of n, provided that one of the two conditions exists:

• The underlying population distribution is normal
• The sample size is significantly large ("large" usually means greater than 30, but this can be flexible if the underlying distribution is close to normal).

The rule above is referred to as The Central Limit Theorem. If you want to find the probability that the mean of a sample falls between two values, you convert those values to Z-scores (but make sure you convert the standard deviation $\sigma$ to the standard error $\frac{\sigma}{\sqrt{n}}$.

${\mu_{\bar{x}}} \approx \mu$

$\sigma_{\bar{x}} \approx \frac{\sigma}{\sqrt{n}}$

To review, see section 7.2 and section 7.3 of the textbook, as well as Sampling Distribution of the Sample Mean.

### Unit 3 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

• normal distribution
• empirical rule
• standard normal distribution
• sampling distribution of the mean
• central limit theorem
• mean
• standard deviation
• standard error

## Unit 4: Sampling and Sampling Distributions

### 4a. Differentiate the population from a sample

• What is the difference between the population and a sample? Why are samples necessary?
• If you're a fan of a sport, you're used to reading about statistics from a particular player or team. Suppose you're looking at a baseball player's batting average for a season. Would this be considered a parameter or a statistic? Should a football game's halftime statistics really be called "halftime parameters"? Why or why not?

The population is the group being studied and the sample is a (hopefully) representative portion of that population. Rarely will we have access to an entire population so we must gather descriptive statistics (mean, standard deviation, etc.) from the sample and use them to infer properties of the population. This last part is called inferential statistics.

The information about a population (mean, etc.) are referred to as parameters, the information about a sample are statistics.

To review, see section 7.1 of the textbook.

### 4b. Define and apply simple random sampling

• What is simple random sampling and how does it differ from other methods?
• Are there situations where simple random sampling may not be the best sampling method?

Simple random sampling from a population means that each member of a population has an equal chance of being randomly selected.

The probability of $x$ taking on a certain value can be found using equally likely outcomes. If there are 3 red marbles, 2 white and 5 blue, the probability of selecting a red marble at random is 3 out of 10.

To review, see section 1.1 of the textbook.

### 4c. Determine different types of selection bias and sampling errors, and explain how to avoid these errors in survey sampling, such as selection and estimation errors

• Why is it important to be careful to make sure that your sample is representative of the population?
• What types of errors or biases must a researcher be aware of when selecting his/her sample and running the survey.

There are several types of response bias that you have to be careful about when designing surveys and obtaining samples:

• For example: "Would you say traffic contributes more or less to air pollution than industry"?
• Results: Traffic, 45%; Industry, 27%
• When the order is reversed, we get different results:
• Results: Industry, 57%; Traffic, 24%
• Misleading Conclusions: Concluding that one variable causes the other variable when in fact the variables are only correlated or associated together. Two variables that may seemed linked are smoking and rate of heartbeat. We cannot conclude the one causes the other. Correlation does not imply causality.
• Small Samples: Conclusions should not be based on samples that are far too small.
• Example: Basing a school suspension rate on a sample of only three students
• Loaded Questions: If survey questions are not worded carefully, the results of a study can be misleading.
• 97% yes: "Should the President have the line item veto to eliminate waste?"
• 57% yes: "Should the President have the line item veto, or not?"
• Leading Questions: The wording of the question may be loaded in some way to unduly favor one response over another. For example, a satisfaction survey may ask the respondent to indicate where she is satisfied, dissatisfied, or very dissatisfied. By giving the respondent one response option to express satisfaction and two response options to express dissatisfaction, this survey question is biased toward getting a dissatisfied response.
• Social Desirability: Most people like to present themselves in a favorable light, so they will be reluctant to admit to unsavory attitudes or illegal activities in a survey, particularly if the survey results are not confidential. Instead, their responses may be biased toward what they believe is socially desirable.

To review, see section 1.2 of the textbook. If you would like additional help, try watching Techniques for Random Sampling and Avoiding Bias.

### 4d. Describe and identify the different sampling methods, including systematic, stratified random, cluster, convenience, panel, and quota sampling, and identify an example of each

• Why would you need to use a method other than simple random sampling when gathering your sample?
• Explain the difference between stratified and cluster sampling. When would you use each?
• Why is it important to choose a good sampling method for your situation? What’s wrong with informal methods like Facebook polls?

Simple random sampling is, as the name suggests, the easiest way of drawing a sample from a population. Each member of the population has an equal chance of being selected. It is often referred to as the "picking names out of a hat" method. Select a piece of paper randomly from a hat, or use a random number generator. There are cases, however, when using a simple random sample can bias your sample.

Stratified sampling is used when your population is non-homogenous and you want to make sure that various groups in the population are proportionally represented in the sample. Suppose you are conducting a survey at your college and you know that males and females will have very different opinions yet you want your sample to be representative of the entire population. You divide the population up into these groups, or strata, and take a simple random sample from each group. If the student population is 60% female and 40% male, and you want to sample 50 students, you should randomly select 30 women and 20 men.

Cluster sampling is similar to stratified sampling, except it is used when you have several subgroups of a population that are already heterogenous and representative of the sample. Then you select a simple random sample not of the entire population, but from among the entire clusters. Example: A large apartment complex has 1,000 residents living in 10 buildings of 100 people each. If you want to select a sample of 200 residents, and you want to minimize the walking up and down stairs, you randomly select 2 of the 10 buildings and sample every person in them.

Systematic sampling samples every kth item. This is useful when you know what sample size you want but can only approximate the population size. A classic example is doing a quality inspection on every 20th item to come off an assembly line. This is also sometimes called the "shopping mall" method because of people standing in shopping centers selecting every 10th customer that walks in.

Convenience sampling is not really a valid method, but is mentioned here for illustration. Convenience means something like saying "whatever, I'll poll my Facebook friends". Any time there is no diversity in the sample, or when the sample is self-selected, it is prone to bias. An example would be if you took a poll to see who should be the next President of the United States, and you sampled people overwhelmingly from one gender, age group, profession, or state, or asked members of a political club you belong to.

If you would like additional help, try watching Techniques for Random Sampling and Avoiding Bias.

### 4e. Use a point estimator from a sample to estimate the entire population

• What is a point estimator for a population parameter, which statistic is often used?
• What is the purpose of a point estimator?

The point estimator is the starting point when estimating a parameter from the population based on sample statistics, or sample data. In most cases, it is equal to the sample statistic. For example, if you are trying to estimate the population mean, a confidence interval would be based on the point estimate plus or minus a margin of error. This course covers how to calculate these margins of error for finding means and other parameters in Unit 5. The point estimator can also be used to generate a p-value, which gives you a clue on whether or not to reject a particular hypothesis about the population.

If you would like additional help, try watching Confidence Intervals and Margin of Error.

### Unit 4 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

• population
• sample
• parameter
• statistic
• descriptive statistics
• inferential statistics
• simple random sampling
• equally likely outcomes
• response bias
• stratified sampling
• cluster sampling
• systematic sampling
• point estimator

## Unit 5: Estimation and Hypothesis Testing

### 5a. Estimate intervals over which the population parameter could exist

• What is a point estimate and how is it used to calculate a confidence interval?
• What is the purpose of a confidence interval?
• What is meant by, say, a 90% confidence interval as opposed to a 99% confidence interval? If we can choose, why not a 100% confidence interval? What would that require?
• Is a confidence interval always symmetric?

Inferential statistics are procedures used to infer properties of the population (parameters) from a sample (statistics) of the data.

A confidence interval is a range of values in which the population parameter is likely to occur. For many statistics, it is made up of the sample statistic plus or minus some margin of error.

Confidence levels can vary, and typically we will refer to 90%, 95%, or 99% confidence intervals. A 95% confidence interval, technically speaking, means that if the sample statistic such as the sample mean is equal to the population mean with a particular standard deviation, then 95% of those confidence intervals will contain the true population mean when taking samples of a particular size and finding the associated confidence intervals. Some people interpret a confidence interval as "there's a 95% probability that the population mean is between x and y". That is pretty close in practice, and probably good enough for our purposes, but not technically true.

The first step is to designate a point estimate (a sample mean), and then (for means and proportions) to calculate a margin of error and add and subtract that from the point estimate to get your margin of error.

For other parameters, such as variance, the sampling distribution is not symmetric, so the low and high ends of the interval are not equally distant from the point estimate, so each must be calculated separately.

### 5b. Determine and differentiate between the null and alternative hypotheses in hypothesis testing

1. What is the difference between a null and alternate hypothesis. Which one are we trying to prove is correct/incorrect?
2. What are the three directions that an alternative hypothesis can run in?

A null hypothesis is the default statement in running a hypothesis test. It is signified using the symbol H0 and an equals sign. It is believed to be true, unless there is evidence (in the form of a p-value or critical/test value pair) that it is not.

An alternate hypothesis ($H_a$) is what the tester is trying to show to prove the null hypothesis false. It is generally stated as being not equal to ≠, less than < , or greater than > the null hypothesis.

A good analogy is the adage "innocent until proven guilty". A person is assumed innocent ($H_0$) until proven guilty ($H_a$). "Proof of innocence" is not a standard, and in the same way, we can never prove the null hypothesis true.

A hypothesis test can be right-tailed if the alternate hypothesis is greater than the null, left tailed if the alternate hypothesis is less than the null, and two-tailed if it is in a different from. By "different" we mean significantly different – for example, you might have a null hypothesis where the mean is 100, and the test data might give you a sample mean of 100.5. That is different, but in the context of hypothesis testing, we would ask "is the difference significant enough so that it cannot be due to random sampling error?" It is a combination of greater than or less than, and the p-value will double the right or left tail area for a two-tailed test.

To review, see section 9.1 of the textbook.

### 5c. Identify when to use the z- and t-distributions, and use these distributions to find probabilities

• How do you decide whether to use a z-distribution or a t-distribution? What is the difference in the distributions and the processes used? Is more information needed when using a t-distribution? When can you not use either one?
• If the sample size is small, what must be true about the population distribution regardless of what test you use?
• What significance does the Central Limit Theorem have in hypothesis testing?
• What is a degree of freedom and where is it used?

When running a hypothesis test for the mean of a single population, part of the process involves finding a test value, and possibly a critical value.

The standard normal (z) distribution or the "Student's t" (or just t) distribution is used, depending on two factors: whether the population's standard deviation $\sigma$ is known, and what the sample size is. The rule of thumb is to use the t-distribution whenever the standard deviation is unknown and the sample size is small. The generally accepted definition of "small" is less than 30, but you can get away with a slightly smaller sample size if you are certain that the population distribution is normal.

Recall the Central Limit Theorem. The sampling distributions of the mean are normal if you have a large enough sample or the underlying population distribution is normal. Just use caution, since small sample sizes still require a normal distribution. If the population distribution is non-normal or unknown and the sample size is under 30, then neither z nor t may be used!

If a t-distribution is used, we need another parameter known as degrees of freedom. For most tests, including tests for the mean and proportion, the degree of freedom is equal to $n-1$ – that is, it is one less than the sample size.

Some resources will tell you to use the t-distribution whenever $\sigma$ is unknown, regardless of sample size, and to give a conservative estimate. However, as the sample size increases, the difference between z-distributions and t-distributions becomes insignificant anyway. In fact, it can be mathematically shown that the z-distribution is the t-distribution with infinite degrees of freedom. If you look at a t-distribution table where there are infinite degrees of freedom, you will see the critical values used for z-distributions. For sample sizes above 30, the difference between using z-distributions and t-distributions will not be very substantial.

If you truly can't decide between a z-distribution or a t-distribution for a one- or two-means test, then go with a t-distribution. It is always the more conservative option, and will always work as long as the conditions for Central Limit Theorem hold.

To review, see section 9.3 of the textbook.

### 5d. Test hypotheses of the population mean and population proportion using one or two samples

• If you're using the p-value method for a two-tailed test, what must you do (to find the p-value) with the area to the right of the critical value?
• In which instances can you use the Z distribution, in which cases should it be a T distribution, and in what cases can you use neither?
• What is the difference between the two modes of hypothesis testing (critical value vs. p-value)? Which one is more frequently used in real situations? Why do you think this might be?

There are two general methods for running a hypothesis test:

• The critical value method requires finding and pairing the critical value (defined by the significance level $\alpha$, and the number of degrees of freedom if applicable), and the test value (achieved by a test-specific formula). If the test value falls above (right-tailed), below (left-tailed) or above the absolute value (two-tailed) of the critical value, the null hypothesis is rejected. If not, it isn't rejected.
• The p-value method (the one used most often in real life), requires finding the test value as described above, then finding the probability (called the p-value) of that test value falling above (right-tailed), below (left-tailed), or above the absolute value (above the absolute value) of that test value. If the p-value is less than $\alpha$, the null hypothesis is rejected; if not, then we fail to reject the null hypothesis.

When testing for a single population mean, use z- or t-distributions as described above.

When testing for one or two proportions, use the z-distribution. The sampling distribution of the sample proportions can be assumed normal if $np_0 > 5$ and $n(1-p_0) > 5$, where $n$ is the sample size and $p_0$ is the hypothesized population proportion. For two proportions, this must be true for each sample, and each sample must be independent of the other. The t-distribution does not apply to proportion tests

When testing for the difference of means, we typically use the t-distribution for all cases. The z-distribution could be used if both sample sizes are above 30 and the standard deviations are known, but the t-distribution is usually better and more conservative.

For the difference of independent means test, the degrees of freedom has a very complicated formula, but if both sample sizes are large enough, you can generally go with $n_1 + n_2 - 2$. The formula for test value depends on whether or not you assume the standard deviations in the two populations are equal. If they are, you find the pooled (combined) standard deviation and use that instead of the individual values.

To review, see section 9.4, section 10.1, section 10.3, and section 10.4 of the textbook.

### 5e. Define and apply the significance level, and explain its importance to hypothesis testing

• What is the difference between Type I and Type II error? Can you name examples where each one would be the one considered more serious? How would this affect the level of alpha chosen?
• If the researcher is free to select the significance level ($\alpha$), why not just make it as small as possible? Why not make it zero?

A hypothesis test has two possible conclusions: We either reject the null hypothesis (prove it false) or fail to reject the null hypothesis (fail to prove it false).

A Type I Error (also called alpha-level error) occurs when we incorrectly reject the null hypothesis – that is, the null hypothesis is true, and by random sampling error we happen to get unusual data that "disproves" it. The probability of a Type I error is designated by the experimenter before the data is collected, and signified by the Greek letter alpha, $\alpha$.

A Type II Error is the probability of incorrectly rejecting the null hypothesis. In other words, there is a difference between the null hypothesis and reality, but the test fails to find that difference. Type II errors are represented by the Greek letter beta, $\beta$.

The probability of correctly rejecting the null hypothesis, $1-\beta$, is known as the power of the test. A real-life use is testing for a drug's effectiveness. The null hypothesis is usually no effect, and finding an effective drug requires rejecting the null hypothesis. If the drug is effective, the power tells us the probability of finding effectiveness.

The alpha-level can be selected beforehand, and has the standard values of 0.01, 0.05, and 0.10. Each of these is the probability of rejecting the null hypothesis. $\alpha$ and $\beta$ are complementary in the sense that lowering the level of $\alpha$ will necessarily raise the level of $\beta$, and vice versa, all else being equal. So, you would typically choose to lower $\alpha$ if a Type I error is more serious, but raise it if Type II error is more serious. This is a balancing act.

$\beta$ is calculated given $\alpha$, the standard deviation, and a particular alternate value for the parameter. Calculation of $\beta$, often expressed as a function of the alternate parameter value, is beyond the scope of this course.

To review, see section 9.2 of the textbook.

### 5f. Compute a test statistic and determine a region of acceptance based on a test statistic

• How would you find a test statistic? How would you find a critical value?
• How would you find the p-value for a hypothesis test by using the test statistic? What special precautions must you take in a two-tailed test?

A test statistic is a value computed from the sample data using a set formula depending on the parameter being tested and distribution used.

A critical value is the cutoff point that will have an area $\alpha$ to the left (left-tailed test), or right (right-tailed test). If the test is two-tailed, then the critical value will have $\frac{\alpha}{2}$ area to the right OR left. It is necessary to split $\alpha$ in half, because two tailed tests need the same significance level but split into both tails instead of one. The areas found to the right, left, or outside of the critical values are called the critical regions.

The critical value hypothesis test will reject the null hypothesis if the test statistic falls inside the critical region(s). This tells us that the sample data is far enough away from the hypothesized mean that there is a good probability that the two are different. If this is not the case (that is, if the test value is closer to the hypothesized mean and thus not in the critical region), we fail to reject the null hypothesis.

The p-value hypothesis test (the one used most often in real life) requires finding the test value as described above, then finding the probability (called the p-value) of that test value falling above (right-tailed), below (left-tailed), or above the absolute value of that test value. If the p-value is less than $\alpha$, then the null hypothesis is rejected; if it is not less than $\alpha$, then we fail to reject the null hypothesis. If you use the p-value method in a two-tailed test, find the area to the right of the positive test value and double it, because you have to test in both tails.

To review, see section 9.4 of the textbook.

### Unit 5 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

• point estimate
• confidence interval
• null hypothesis
• standard normal distribution
• T distribution
• degrees of freedom
• pooled (combined) standard deviation
• Type I Error
• Type II Error
• alpha (significance) level
• test statistic
• critical value
• critical value hypothesis test
• p-value hypothesis test

## Unit 6: Correlation and Regression

### 6a. Identify the dependent and independent variables in the linear regression model

• Why would the $x$ variable be called independent and the $y$ value be called dependent?

The independent variable is the input (usually $x$) variable in a linear regression equation. It is called independent because it can be selected by the researcher.

The dependent variable is the output (usually $y$) variable in a linear regression equation. It is called dependent because it cannot be selected and is the result of an equation.

To decide which is which, ask yourself which variable is trying to be predicted from the other. The result of the study is dependent, the variable you’d have to input is the independent.

To review, see section 13.1 of the textbook and Examples of Univariate and Multivariate Data.

### 6b. Calculate the equation of the regression line and plot it

• What does the regression line drawn on a scatter plot tell you?
• What must you do before calculating the regression line? Why is this necessary?

The regression line (usually called the least-squares regression line) is a linear equation in the form $y=a+bx$ that presents the best possible linear relationship.

The regression equation will always give you the best fit line, assuming that there is actually a linear fit between $x$ and $y$. You need to calculate the linear correlation coefficient to determine if there is a linear relationship. If the line isn't a good fit, the relationship isn't linear. A relationship between two variables is linear if the dots in the resulting scatter plot make up roughly a straight line, and if you can reasonably use a linear equation $y=a+bx$ to predict $y$ from $x$.

There are formulas given in the book to calculate the correlation coefficient and the regression line slope, but typically you will do this using computer software, apps, websites, or graphing calculators.

To review, see section 13.4 of the textbook.

### 6c. Describe the importance of the correlation coefficient and r-squared, and apply these concepts

• How does the correlation coefficient relate to the slope of the regression line? What do they have in common? Will a correlation of 1 always give you the same sloped line?
• What is the relationship between the correlation coefficient and r-squared? What does each tell you?
• If the points in the scatter plot make a perfect semicircular shape, there is a clear relationship between $x$ and $y$. Does this mean that the correlation r should be close to 1? Why or why not?

A scatter plot is an x-y diagram where all of the data pairs are plotted.

The correlation coefficient ($\rho$) is the measure of linear relationship between $x$ and $y$. The sample correlation coefficient is calculated from the data and represented by the letter $r$. The correlation will always be a number between and including -1 and 1. The closer to 1, the better positive fit (positive sloped line). The closer to -1, the fit is still strong, but is a negative fit (negative sloped line). A correlation of 0 means there is no linear relationship at all. Values of exactly -1, 0, and 1 are very rare in practice. Usually even randomly selected $x$ and $y$ will get you around ±0.1 to 0.2. You will only see a +1 or a -1 if all dots in the scatter plot are perfectly aligned.

R-squared is the square of the correlation coefficient. The interpretation is that this is the proportion of the dependent variable $y$ that can be explained by the independent variable $x$. So if the correlation $r = 0.6$, then about 36% of the variation in $y$ can be explained by variation in $x$.

To review, see The Correlation Coefficient.

### 6d. Define outlier, identify examples of outliers, and describe what an outlier can do to summaries of data

• What is an outlier? What should be done when encountering one?
• What does it mean for a statistic to be resistant to outliers?
• What is the difference between the mean and median and which should be used when outliers are present?

An outlier is a data point that doesn't seem to fit with the rest. This can occur with univariate (single variable) and bivariate (this unit) data. Outliers can happen for various reasons. They may be true data, or possibly the result of a miscalculation, mis-measurement, or the data entry person in that weight loss study input a loss of 95 pounds instead of 9.5.They can not be discarded. If the statistician is unsure, he/she should contact the researcher or double-check with the data entry person.

Outliers in data will severely affect the mean. In other words, the mean is not resistant to outliers. An unusually high data value will skew the data out to the right, which is why we refer to that data as right-skewed. The same thing happens if the outlier is well below the rest of the data, the mean will be much lower than the median, in which case the data is left-skewed. In symmetric (not necessarily normally distributed) data, the mean and median are roughly the same and there are either no outliers, or in the case of a bell curve, the outliers are evenly on both sides.

Because the least-squares regression line is based on distance between data points and the mean, it will not be resistant to outliers, and an outlier can severely affect the equation. In this case, it is best to investigate the outlying data point and whether it's true. If it is a false entry, it may be discardable. If it is a true entry, it may be ignorable. This depends on the study and the researcher.

To review, see section 6.6 of the textbook and Linear Regression and Correlation.

### 6e. Estimate a regression line and identify the effect of the independent variable on the dependent variable

• When finding the correlation coefficient $r$ for a group of bi-variate data, how can we know whether that $r$ value is significant – that is, that there is a good linear fit for the data?
• What is a good way to estimate the line of best fit?

You can estimate the equation of a regression line by eyeballing a line that best fits the data, finding two points on that line, and then using the rules of algebra to find the equation of a line ($y=mx+b$).

The significance of correlation coefficient can be found through a T-test where the test statistic is $T=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$ and the degrees of freedom is $n-2$.

To review, see section 13.2 of the textbook.

### 6f. Draw a scatter plot, find the equation of a least-squares line, and draw a least-squares line

• What is the first step you must do in the process of finding the regression line?
• Why is it called "least squares" regression?
• Could non-linear data still generate a believable regression line? If so, how can we avoid finding a regression line for data that is not linear?

A scatter plot is an x-y diagram where all of the data pairs are plotted. If the data is linear in shape, we can estimate a line of best fit $y=a+bx$, also called a least squares line. We draw a line between each point and the regression line and make that into the side of a square. The equation of the line will be the one for which the sum of the areas of those squares is as small as possible, thus the "least squares" regression line.

There are formulas to calculate the correlation coefficient and the slope of a regression line,

$r\approx \frac{n\sum xy -\sum x \sum y}{\sqrt{n\sum x^2-(\sum x)^2}\sqrt{n\sum y^2 - (\sum y)^2}}$

$b=r\frac{s_{y}}{s_{x}}$

but usually these tasks will be done using computer software, apps, websites, or graphing calculators.

We will use the formula to calculate the slope, then use the properties of linear equations in algebra to plug in the mean values of $x$ and $y$, as well as the slope $b$, into the equation $\bar{y}=a+b\bar{x}$ to find $a$.

To review, see section 13.4 of the textbook.

### Unit 6 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

• independent variable
• dependent variable
• regression line
• correlation coefficient
• R-squared