The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Simulation and Data Design

Exercises

  1. In cluster sampling, the population is divided into non-overlapping subgroups, which tend to be smaller than strata. The sampling method is to take a simple random sample of the clusters and include all of the units in a cluster in the sample. Use our urn analogy, to express cluster sampling. As a simple example, suppose our population of 7 starship prototypes are placed into 4 clusters as follows: (A,B) (C,D) (E,F) (G). Suppose we take a SRS of 2 clusters.
  1. List all of the possible samples that might result.
  2. What is the chance that A is in a sample?
  3. What is the chance that A, C and E are in the sample?

Cluster sampling has a distinct advantage of making sample collection easier. For example, it is much easier to poll 100 homes of 2-4 people each than to poll 300 individuals. But, since people in a cluster tend to be similar to each other, we need to keep the sampling procedure in mind as we generalize from sample to population.

  1. Systematic sampling is another popular technique. To start, the population is ordered, and the first unit is selected at random from the first k elements. Then, every k^{th} unit after that is placed in the sample. As a simple example, suppose our population of 7 prototypes is ordered alphabetically and we select one from the first three A,B at random, and then every second element after that.
  1. List all of the possible samples that might result.
  2. What is the chand that A is in the sample?
  3. What is the chance that A and B are in the sample? A and C?

Intercept surveys are when a popup window asks you to complete a brief questionnaire. If every k^{th} visitor to a website is asked to complete a brief survey, then we have a systematice sample. Here the population consists of visits to the site, and the ordering for systematic sampling, is the order of the visits. It seems reasonable to imagine that this ordering wouldn't introduce a selection bias in the sampling process.