The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Simulation and Data Design

In this chapter, we develop the theory behind the chance processes introduced in the previous chapter. This theory makes the concepts of bias and variation more precise. We continue to motivate the accuracy of our data through the abstraction of an urn model that was first introduced in the previous chapter, and we use simulation studies to and helps us understand and make decisions based on the data.

We begin with an artificial example of a small population; it's so small that we can list all the possible samples that can be drawn from the population. Then, we consider simple variations on drawing marbles from the urn to extend the urn model to more complex sampling designs those used in complex surveys.

Next, we use the urn model as a technical framework to design and run simulation studies to understand larger and more complex situations. We return to some of the examples from the previous chapter and, for example, dive deeper into understanding how the pollsters might have gotten the 2016 Presidential Election predictions wrong. We use the actual votes cast in Pennsylvania to simulate the sampling variation for a poll of 1,400 from six million voters. This simulation helps us uncover how response bias can skew polls, and convince us that collecting a lot more data would not have helped the situation (another example of big data hubris).

In a second simulation study, we examine the efficacy of a COVID-19 vaccine. A designed experiment for the vaccine was carried out on over 50,000 volunteers. Abstracting the experiment to an urn model gives us a tool for studying assignment variation in randomized controlled experiments. Through simulation, we find the expected outcome of the clinical trial. Our simulation, along with careful examination of the data scope, debunks claims of vaccine ineffectiveness.

In addition, to sampling variation and assignment variation, we also cast measurement error in terms of an urn model. We use multiple measurements from different times of the day to estimate the accuracy of an air quality sensor. Later, we provide a more comprehensive treatment of measurement error and instrument calibration for air quality sensors.