Topic outline

  • Unit 2: Basic Object Types and Operations in R

    Data surround us in every shape and form in our daily life: not just as numbers we may see in a weather report but also as sounds, text, and images. By representing data in a standard way in R, we can apply various R functions for data analysis. This unit introduces commonly used data types and explains how the data can be organized as objects in our coding environment. We also discuss how to subset and join several such objects and change the object type.

    Completing this unit should take you approximately 3 hours.

    • Upon successful completion of this unit, you will be able to:

      • describe different types of objects in R (vectors, matrices and arrays, lists, and data frames);
      • identify the type of an object in R and its dimensions (size);
      • create, concatenate, and subset (index) different R objects; and
      • convert objects from one type to another.
    • 2.1: Data Types

      R's variety of data types most likely covers all possible statistical analysis needs. This section introduces the common types and provides more details on strings and factors.

      • These are the data types we encounter in everyday work in R. You should learn about their differences and how to access their basic attributes. Most often, we need to know whether the data are in the correct format (such as numeric instead of character) and the size of the R object (use functions length(x) and dim(x) for that).

      • Now let's try a quick practice exercise creating variables. This exercise does not count toward your grade. It is just for practice!

      • Although not all of us are linguists or text analysts, R functions for operating with text strings are still useful. They will come in handy when you need to match records in the data or select a portion of a textual record (for example, only the first name and not the surname). This section covers the basics of these operations.

      • This practice aims to make you more familiar with the string format and operations we often need to match or subset strings while preparing data for analysis. Try to solve these string operations tasks. This exercise does not count toward your grade. It is just for practice!

      • Factors are the way categorical variables are stored in R. For example, treatment levels in ANOVA (analysis of variance) are considered factors; months or quarters of the year can be represented as factors for modeling seasonality. You should learn how to create factors, rename and reorder factor levels for convenience, and correct analysis (for example, the control treatment usually should be the first level of a factor because, by default, other levels are compared to the first one in linear models).

      • In these exercises, you will practice operations with factors needed for implementing the analysis of variance (ANOVA) analysis or drawing a boxplot. Also, try applying the function table(X) to some factor X in your R environment – it is the function that quickly counts the number of occurrences of each element in X. These exercises do not count toward your grade. They are just for practice!

    • 2.2: Vectors

      Vector is the simplest data structure in R and a building block for more complex objects. You should learn how to create and subset or select elements in a vector.

      • This section introduces the basic operations on vectors, most of which are done element-wise. Please pay attention to the recycling of vectors (usually, recycling doesn't generate an error or a warning, so it is easy to miss if it was unintended), missing values (NA), and logical vectors often used for data subsetting.

      • The type of your data in R can be changed. Sometimes some other function you apply automatically changes the type internally, while the data object you supplied remains unaffected. For example, if x is a character object, lm(y ~ x) will treat x as a factor; x will remain the type character in the R environment. In other cases, to count the total or proportion of certain instances using a logical vector LV, you can apply sum(LV) or mean(LV) knowing that the logical values TRUE and FALSE will be treated as 1 and 0 by these functions. Please pay attention to these coercion rules.

      • This exercise shows how easy it is to work with vectors in R and modify and reorganize the data in vectors. In the exercise, you will create and manipulate a vector, then save elements 5-10 of your vector as a new (separate) vector. This exercise does not count toward your grade. It is just for practice!

    • 2.3: Arrays and Matrices

      Matrices have two dimensions, and arrays have three or more dimensions. For example, a digital color image of 100×100 pixels can be represented by an array with dimensions 100×100×3, where 3 represents the red, green, and blue (RGB) color representation. Data in matrices or arrays are all of one type or missing (NA), which makes these objects efficient to work with.

      • The video demonstrates the differences between matrices and arrays and how these objects' elements can be accessed or subsetted. 

      • An array can be considered as a multiply subscripted collection of data entries. This section provides details on the construction and manipulation of arrays.

      • This section provides details on the construction and manipulation of these objects, including matrix facilities that are different from typical element-wise operations.

      • These exercises test your knowledge of creating, accessing, and manipulating arrays and matrices. These exercises do not count toward your grade. They are just for practice!

    • 2.4: Lists and Data Frames

      List objects in R allow us to combine data of different types and sizes. The list is the most flexible R object, however, the user typically pays for this flexibility with reduced performance (speed of computations) or excessive verbose coding needed to refer to a specific element in a list. This section introduces the base-R list implementations (list and data.frame) and contributed modifications (tibble and data.table) to the data.table format. These modifications address the issues of computing inefficiency and code repetitiveness. The goal is to get familiar with each format (data.frame, tibble, and data.table).

      • Lists are used to hold elements of different sizes and types, such as outputs of a regression model fit or results of a statistical test. However, if we restrict list elements to vectors of the same length, we can get a data.frame. The data.frame structure is in-between a matrix (data.frame has columns and rows and can be indexed as a matrix) and a list (each column in a matrix is a list element and can be indexed accordingly). The data.frame structure is convenient for holding typical spreadsheet data, where each column can be of a different type, for example, Date (Date class), Location (character type), and Temperature (numeric type).

      • We'll practice with the data frame format, which is the usual format for storing information on different variables. We'll practice the extensions of this format later. Use the cats data frame to solve these challenges

        cats <- data.frame(coat = c("calico", "black", "tabby"),
                            weight = c(2.1, 5.0, 3.2),
                            likes_string = c(1, 0, 1))

      • The tibble format belongs to the family of packages "tidyverse" and attempts to make operations with data.frame-like structures more user-friendly. The tidyverse conveniently aggregates several popular packages, such as ggplot2 for plotting and dplyr for data manipulation. You can convert a data.frame to tibble and back if needed.

      • Try working with tibbles in these exercises. Remember, tibbles are just another form of the data.frame format. Do you find tibbles more convenient to use? This exercise does not count toward your grade. It is just for practice!

      • The data.table format also helps shorten code when working with data.frame structures. Most importantly, data.table handles big data very efficiently. You can convert a data.frame to data.table and back if needed.

      • This is the final practice of data frame formats. Now you should be familiar with the three main ones: data.frame, data.table, and tibble. Repeat the practice tasks for tibbles now using the data.table format instead. This exercise does not count toward your grade. It is just for practice!

    • Unit 2 Assessment

      • Take this assessment to see how well you understood this unit.

        • This assessment does not count towards your grade. It is just for practice!
        • You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
        • You can take this assessment as many times as you want, whenever you want.