The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Simulation and Data Design

Summary

In this chapter, we used the analogy of drawing marbles for an urn to model random sampling from populations, random assignment of subjects to treatments in experiments, and measurement error. This framework enables us to run simulation studies for a hypothetical survey, experiment, or some other chance process in order to study their behavior. We found the expected outcome of a clinical trial for a vaccine under the assumption that the treatment was not effective; and, we studied the support for Clinton and Trump with sample that used the actual votes cast in the election. These simulation studies enabled us to quantify the typical deviations in the chance process and to approximate the distribution of summary statistics. That is, simulation studies can reveal the sampling distribution of a statistic, and help us answer questions about the likelihood of observing results like ours under the urn model for variation.

The urn model reduces to a few basics: the number of marbles in the urn; what is written on each marble; the number of marbles to draw from the urn; and whether or not they are replaced between draws. From there we can simulate more and more complex data designs. However, the crux of the urn's usefulness is the mapping from the data design to the urn. If samples are not randomly drawn and subjects are not randomly assigned to treatments, then this framework can't help us understand our data and make decisions. On the other hand, we also need to remember that the urn is a simplification of the actual data collection process. If in reality, there is bias in data collection, then the randomness we observe in the simulation doesn't capture the complete picture. Too often, data scientists wave these annoyances away and address only the variability described by the urn model. That was one of the main issues in the surveys prediciting the outcome of the 2016 US Presidential election.