BUS204 Study Guide
Unit 1: Introduction to Statistical Analysis
1a. Explain the importance of statistics to business
- Why is it important for business students to study statistics?
Although you will very rarely be performing statistical studies or calculate conclusions from surveys in the business world, data is everywhere, and the business community is no exception. Is the increase in sales last year significant? Which of 5 laptop colors do consumers most prefer? If you're a business manager, you WILL be involved in data-based decisions, and you will likely be working with or supervising marketing or market research staff. In such cases, knowing at least some statistical concepts will be vital for knowing how well those teams are performing.
Also, in reading the results of statistical studies, you should understand concepts such as the p-value (the number that tells you the strength of a statistical hypothesis) to properly utilize the results of studies to make sound business decisions.
In a lot of ways, using statistics in business is similar to doctors running tests on patients. The test results themselves are usually computer-generated. The lab staff run the tests, but it takes a trained physician to interpret the results accurately. That will be similar to your role – you'll have to be able to read statistical results and know how best to use them.
Review Why Do We Need to Study Statistical Analysis as Part of a Business Program? for more details.
1b. Explain the differences between quantitative and qualitative data, and identify examples of each type of data
- What is the difference between quantitative and qualitative data? What is another name for each?
- Is a phone number considered quantitative or qualitative? It's made up of numbers so it must be quantitative, right? Same thing for a ZIP code.
In statistics, the difference between various data types is important, since it affects which types of tests and graphs are more useful for certain kinds of questions. Quantitative (also known as numeric) data is data that is mathematical in nature. With this kind of data, we can find the sum, average/mean, and other statistics that involve mathematical operations. Qualitative (also known as categorical) data is non-numeric. This kind of data is made of text, letters, words, or even digits, but in these cases, no mathematical operations are possible or make sense.
For example, consider phone numbers and ZIP codes. Although they are made of numbers (and sometimes dashes or parentheses), it would not make sense to do calculations with them. Of all the phone numbers stored in your phone, you would never want to find an "average phone number" or a "total zip code". For this reason, phone numbers and ZIP codes are considered qualitative data.
If you're wondering about the grammar involved, "data" is a plural term that refers to multiple numbers or points. The singular is "datum", but you only see that word used very rarely, since we are almost always looking at multiple numbers when performing statistical analysis.
Review Definitions of Statistics, Probability, and Key Terms for more details.
1c. Define and apply the following terms: data sets, mean, median, mode, standard deviation, variance, population and sample
- What is the difference between a measure of central tendency and a measure of spread?
- In what situations is using the median for a measure of center preferable to using the mean? In what situation is mode the only viable possibility?
- What's the difference between a population and a sample?
- What is the difference between standard deviation and variance?
A set is a collection of any number of items, and a data set is a collection of any number of data points.
Measures of central tendency include the mean, median, and mode, and describe the location of a data set or what a "typical" data point is within that set.
Measures of spread include the variance and standard deviation. They are just as important as measures of center, describing how homogenous (low spread) or heterogenous (high spread) the data is. Data with a lower spread is easier to do predictive (inferential) analysis on. Variance is the average square of the differences between a data point and the mean. We use the square of the difference instead of just the difference so that a data point far above the mean doesn't cancel out a data point far below the mean to show no difference. The square root of the variance is taken to give us the standard deviation. This helps to get the data back to its original units. If the original data is in minutes, then the unit of the variance will be "square minutes", which doesn't make sense, so taking the square root gets the unit back to minutes.
In some cases, you can use either the mean or median for measure of center, but if the data is right or left skewed (there are more outliers at the high or low ends of the distribution), the mean can be misleading since it is greatly affected by outliers. The median is more resistant to outliers so it will be more preferable when you have skewed (non-symmetric) data as opposed to a uniform distribution or bell curve.
The mode is the data point or points that occur most frequently. It is the only measure of center that can be used for qualitative data, since it's the only one that does not involve calculation. In an election of two or more candidates, the winner is the one whose vote total is the mode of all votes.
The population is the entire set of data that is being studied. In most cases, we don't have access to the population when we do statistical analysis. Because of this, we use one of several sampling methods to find a representative sample of the population. In each case, it is important to define your population and sample. If you find the average price of pizza in a small city by sampling every pizza restaurant, that will be the population, and you can find it by measuring data at every restaurant. However, this data could also serve as a sample if the population being studied is pizza prices in similarly-sized cities in the region.
1d. Summarize and interpret data in a tabular format using frequency distributions and visually with histograms
- What is the difference between a frequency distribution and a relative frequency distribution?
A random variable is a variable that has an unknown value through randomness or experimentation, which differentiates it from an algebraic variable that is unknown until you solve for in an equation.
A random variable can be quantitative or qualitative. A quantitative random variable can be discrete, which means it has a small fixed list of possible values. A classic example is a six-sided die that has the only possible values of 1, 2, 3, 4, 5, or 6. A quantitative variable can also be continuous, meaning there is an infinite or nearly infinite possible number of values. For example, salaries in a large company might run from $20,000 all the way up to $500,000 figures, meaning we wouldn't be able to conveniently list all possible salaries. Even if there were only 5 people in a company whose salaries ranged from $30,000 to $100,000, there would still be 70,000 possible values. Even if the salaries were rounded to the nearest $100, we would still have 700 possible values, which is still way too many to list.
A frequency distribution is a list of all possible values of a discrete random variable. If the variable is continuous, we list intervals of values rather than individual values, since it would not be possible to list all individual values. So, for the salary example above, we could group salaries into intervals of $10,000.
A relative frequency distribution is the same as a frequency distribution, except the total for each data value or interval is divided by the sample size, so that the relative frequencies are expressed as proportions of the whole and thus add to 1.
In the case of continuous variables, it is important to ensure that all intervals are of the same width, that every possible value between the lowest and highest value fits into exactly one interval, and that there are no overlapping intervals.
In a table, if a data value or interval of values has no data points, it must be represented in the table with a frequency of zero.
See section 2.1 of the textbook for more details.
1e. Describe data sets, create frequency tables, and draw histograms
- What is the difference between a bar graph and a histogram?
- Why is it important for the bars on a histogram to have equal width and spacing?
A histogram is a specific type of bar graph made for quantitative data, and is the graphical representation of a frequency or relative frequency distribution. In a bar graph, you can have qualitative data, so the bars do not need to be in any particular order.
A histogram, like a frequency distribution, has to follow similar rules: bars must be of equal width, and they must have no spaces between them. If a particular data value or interval doesn't have any data points in it, then a placeholder (or a zero-height bar) must be placed in its location on a histogram. This is done because if you have outliers in your data set, then placing them too close to the rest of the bars on the histogram will make the data seem more homogenous than it really is. The histogram needs to accurately reflect the shape of the data, so the scale must stay consistent.
A stem-and-leaf plot is similar to a histogram, except the intervals consist of the first digit(s) of a number and are listed down a column. The remaining digits are listed to the right of their corresponding number, at equal intervals for each row.
Unit 1 Vocabulary
This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.
Try to think of the reason why each term is included.
- measure of center
- measure of spread
- standard deviation
- random variable
- frequency distribution
- relative frequency distribution
- bar graph
- pie chart
- stem-and-leaf plot