Topic outline

  • Unit 5: Common Statistical Functions

    The most significant advantage of R is probably the availability of functions for statistical analysis. The ultimate goal of most R courses is to give the learners access to this toolbox. Data and derived inference help shape our decisions; hence it is imperative to do the analysis right. This unit introduces built-in R functions for statistical analysis, from summarizing data and applying simple statistical tests to regression analysis. We will also see how to find additional R functions (packages) for certain types of analysis.

    Completing this unit should take you approximately 3 hours.

    • Upon successful completion of this unit, you will be able to:

      • describe commonly used built-in statistical functions;
      • calculate the mean, median, and quantiles of a sample;
      • obtain statistical summaries of data;
      • test for the statistical difference of 2 means using a t-test;
      • test for statistical difference among multiple means using one-way ANOVA; and
      • implement regression analysis in R.
    • 5.1: Single-Sample Summaries

      Single-sample summaries help us to quantify general patterns in the data. We can now represent something we have observed in a histogram or another plot with numbers. For example, the center of a distribution can be estimated as its mean (that is the statistical jargon for "average") or median, while the spread of the distribution can be quantified by its standard deviation or inter-quartile range (difference between the third and first quartiles). This section combines these numeric estimates with plots, so you can better understand what each of those summary statistics means.

      • After plotting the data, mean and variance are some of the basic summaries that we want to know. R has built-in functions to calculate mean, sd, var, and median. This video demonstrates the calculations and, using the plot, shows how the results relate to the sample data.

      • Even one variable can tell a story. For example, sample data on personal incomes might show distinct clusters of high- and low-paid workers, and time series of average temperatures may show trends and seasonal cycles. Here you will learn R tools for working with such data by combining your experience with plots and simple statistical summaries.

      • As you already know, for each base-R operation, there are user-contributed alternatives. This video demonstrates the function describe from the package psych, which outputs more statistics than the standard function summary. (You already know how to install and load the package to your R environment.) Be careful, as user-contributed packages might use the same names for their functions. For example, the package Hmisc also has a function describe that produces a different output.

      • Functions for individual quantities like mean or median are convenient when we want to use that specific number in further analysis or visualizations, but the function summary and its alternatives are great for exploratory analysis. In the exercise, you can practice both approaches. This exercise does not count toward your grade. It is just for practice!

      • Finally, the function table can count the number of observations per group. It is most useful when applied to factors, integers, logical values, or strings. It allows you to study group counts, proportions, and identify outliers. This section demonstrates the application of this function and how it can be applied to more than one variable.

    • 5.2: The t-test

      The Student's t-test is one of the most frequently used tests in statistics. It is applied to infer whether the population mean differs from some reference value or whether two populations differ. Moreover, the population parameter can be not just the overall average but some other parameter like a regression coefficient quantifying the relationships between variables – this greatly extends the applicability of the test beyond comparing just the population means. For example, we will use the t-test again in the linear regression section when assessing the regression coefficients' statistical significance.

      • The t-test is quite simple, and the base-R functionality will likely be sufficient for all your related calculations. This section introduces the plots and testing functions that help us to conduct the inference based on the t-test and its nonparametric alternative, the Wilcoxon (or Mann-Whitney) test.

      • Fortunately, the t-test calculations can be modified for the cases when the assumption of equal variances across groups is violated. In other words, Welch's version of the t-test accounts for unequal variances. This video demonstrates the test application in R and the relevant options for implementing it.

      • The greater the difference between compared quantities and the more observations we have, the more confident we (the t-test) are that the observed differences are not just due to a random chance but are true, statistically significant differences. Even if the means of two populations are different, the t-test might not detect it if the difference or the sample is small. The probability with which the t-test would detect the difference under the given sample size and variability is the power of the test and can be calculated in R. We prefer high power and often use the desired power, confidence level, and expected variability to identify the required sample size.

      • In this exercise, you will use the t-test and Wilcoxon test to compare the Examination rates across the two groups. This exercise does not count toward your grade. It is just for practice!

    • 5.3: One-Way ANOVA

      We use the analysis of variance (ANOVA) method to compare means across more than two groups. Pairwise tests can be valid for many groups only when an adjustment for multiple testing is used (for example, see the function p.adjust). ANOVA avoids the multiple testing problem by applying the global F-test for any difference among group means. Functions in this section apply to the cases when, again, we have a grouping variable coded in R as a factor. In statistical texts, this variable is also called a factor (comprising several levels or, in the case of a single factor, treatments), meaning that this variable contributes to the differences in means across the groups. For example, physical activity could be a factor affecting the blood pressure (response variable) of a person measured in a clinical study, with the factor levels "No exercise", "Minor exercise", and "Intensive exercise".

      • This section introduces base-R functionality for the one-way ANOVA. "One-way" means that only one factor variable is used, such as in the case of BloodPressure ~ ExerciseLevel. Be aware that when two or more factors are used the contributed functions like car::Anova are preferred because they have the option to apply different types of the F-test and conduct inference without depending on the order the factors are introduced in the R formula.

      • This video shows the implementation of ANOVA in the packages car and afex. It probably makes sense to start using one of these packages for ANOVA analysis instead of the base-R functions aov and anova, even if you have only one factor variable to start with. The video also covers a range of post-hoc tests used to find which groups the statistically significant differences occur between. These tests are useful, but the global ANOVA test is not needed for the analyst to start using these tests – just remember to use an adjustment for multiple testing.

      • In this practice exercise, you will use the built-in dataset iris to test whether the Sepal.Length differs by Species. This exercise does not count toward your grade. It is just for practice!

    • 5.4: Linear Regression

      This section introduces the concepts of statistical modeling and linear regression. Most statistical models fit within this general regression framework, including the t-tests and ANOVA models from the previous sections. Learn this framework, which will be a basis for your more complex models and methods.

      • Models are simplified representations of reality based on available observations. Both the observations and our assumptions about the form of the existing relationships affect the model we get as an outcome of the analysis. Here you will learn the general approach for specifying and estimating a linear model in statistics.

      • While R makes the model fitting process extremely easy, several steps or implicit decisions go into it. For example, one may choose to keep or remove extreme observations (outliers) and select the optimization algorithm. This exercise demonstrates the effects of these decisions on the modeling outcomes.

      • One of the best tools to check the quality of a model is to plot things. This section shows how to visualize modeling results and the unmodeled remainder (residuals) to diagnose the model. Remember that residuals should not have any remaining pattern and should look randomly scattered. If there is a remaining pattern, try to include it in your model (that is, respecify the model), then reestimate the model and visualize the new residuals.

      • You should get used to checking model quality visually. Look for inconsistencies between the data cloud pattern and the fitted lines for patterns in residuals and outlying observations. These exercises give examples and suggest R functions you can use for these tasks. This exercise does not count toward your grade. It is just for practice!

      • Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.

      • We often keep the intercept in the model even if it is not statistically significant because our main focus is usually on the effect of other variables expressed in their coefficients. However, there are cases when we need to remove the intercept to obtain the so-called "regression through the origin". Also, we might need to model the combined effect of two factors using the interaction term (for example, to model how light and water conditions affect plant growth). These exercises let you practice these cases and suggest you compare alternative models. It does not count toward your grade and is just practice!

    • Unit 5 Assessment

      • Take this assessment to see how well you understood this unit.

        • This assessment does not count towards your grade. It is just for practice!
        • You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
        • You can take this assessment as many times as you want, whenever you want.