The Data Science Lifecycle

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: The Data Science Lifecycle
Printed by: Guest user
Date: Wednesday, May 22, 2024, 1:57 AM

Description

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

The Data Science Lifecycle

Data science is a rapidly evolving field. At the time of this writing people are still trying to pin down exactly what data science is, what data scientists do, and what skills data scientists should have. What we do know, though, is that data science uses a combination of methods and principles from statistics and computer science to draw insights from data. We use these insights to make all sorts of important decisions. Data science helps assess whether a vaccine works, filter out spam from our email inboxes, and advise urban planners where to build new housing.

This book covers fundamental principles and skills that data scientists need to perform analyses. To help you remember the bigger picture, we've organized these topics around a workflow for analysis that we call the data science lifecycle. This chapter introduces the data science lifecycle. It also provides a map for the rest of the book by showing you where each chapter fits into the lifecycle. Unlike other books that focus on one part of the lifecycle, this book covers the entire cycle from start to finish. We explain theoretical concepts and show how they work in practical case studies. Throughout the book, we rely on real data from analyses by other data scientists, not made-up data, so you can learn how to perform your own data analyses and draw sound conclusions.

Fig. 1.1
This diagram of the data science lifecycle shows its four high-level steps. The arrows show how the steps lead into one another.

Fig. 1.1 This diagram of the data science lifecycle shows its four high-level steps. The arrows show how the steps lead into


Figure 1.1 shows the data science lifecycle. It's split into four stages: asking a question, obtaining data, understanding the data, and understanding the world. We've made these stages very broad on purpose. In our experience, the mechanics of a data analysis change fequently. Computer scientists and statisticians continue to build new software packages and programming languages for analysis, and they develop new analysis techniques that are more accurate and specialized. Despite these changes, we've found that almost every data analysis follows the four steps in our lifecycle. The first is to ask a question.

Asking a Question

Asking questions lies at the heart of data science because different kinds of questions require different kinds of analyses. For example, "How have house prices changed over time?" is very different from "How will this new law affect house prices?". Understanding our research question tells us what data we need, the patterns to look for, and how we should interpret our results. In this book, we focus on three broad categories of questions: exploratory, inferential, and predictive.

Exploratory questions aim to find out information about the data that we have. For example, we can use environmental data to ask: have average global temperatures risen in the past 40 years? The key part of an exploratory question is that it aims to summarize and interpret trends in the data without quantifying whether these trends will hold in data that we don't have. "How many people voted in the last election?" is an exploratory question. "How many people will vote in the next election?" is not an exploratory question.

Inferential questions, on the other hand, do quantify whether trends found in our data will hold in unseen data. Let's say we data from a sample of hospitals across the US. We can ask whether air pollution is correlated with lung disease for the individuals in our sample - this is an exploratory question. We can also ask whether air pollution is correlated with lung disease for the entire US - this is an inferential question, since we're using our sample to infer a correlation for the entire US.

Note

Be careful not to confuse an inferential question with a question about causality. An inferential question asks whether a correlation exists. "Are people who are exposed to more air pollution more likely to develop lung disease?" is an inferential question. "Does air pollution cause lung disease?" is causal, not inferential. We typically cannot answer causal questions unless we have a randomized experiment (or assume one).

Predictive questions, like inferential questions, aim to quantify trends for unseen data. While inferential questions look for trends in the population, predictive questions aim to make predictions for individuals. An inferential question could ask: "What factors increase voter turnout in the US?" A predictive question could instead ask: "Given a person's income and education, how likely are they to vote?"

As we do a data analysis, we often change and refine our research questions. Each time we do so, it's important to consider what kind of question we want to answer.

In the next section, we'll talk about how our question affects the data we want to obtain.

Obtaining Data

In this step of the data science lifecycle, we obtain our data and understand how the data were collected. One of our goals in this stage is to understand what kinds of research questions we can answer using the data that we have. In our lifecycle, data analyses can begin with asking a question (the previous stage) or with obtaining data (this stage). When data are expensive and hard to gather, we define a precise research question first and then collect the exact data we need to answer the question. Other times, data are cheap and easily accessed. This is especially true for online data sources. For example, the Twitter website lets people quickly download millions of data points 1. When data are plentiful, we can also start an analysis by obtaining data, exploring it, and then asking research questions.

When we obtain data, we write down how the data were collected and what information the data contain. This isn't just for bookkeeping - the type of research questions we can answer depend greatly on the way the data were collected.


Understanding the Data

After obtaining data, we want to understand the data we have. A key part of understanding the data is doing exploratory data analysis, where we often create plots to uncover interesting patterns and summarize the data visually. We also look for problems in the data. Most real-world datasets have missing values, weird values, or other anomalies that we need to account for.

In our experience, this stage of the lifecycle is highly iterative. Understanding the data can lead to any of the other stages in the data science lifecycle. As we understand the data more, we often revise our research questions, or realize that we need to get data from a different source.

This stage incorporates both programming and statistical knowledge. To manipulate data, we write programs that clean data, transform data, and create plots. To find patterns and trends in the data, we use summary statistics and statistical models.

When our research questions are purely exploratory, we are only concerned about patterns in the data. In these cases, our analysis can end at this stage of the lifecycle. When our research questions are inferential or predictive, however, we proceed to the next stage of the lifecycle: understanding the world.


Understanding the World

In this stage of the lifecycle, we draw conclusions about our larger population, and sometimes even the world. The challenge is that our data are typically a relatively small sample compared to our population. To draw conclusions about the population, we use statistical techniques like A/B testing, confidence intervals, and simulation. We also use models like linear or logistic regression. This stage is relevant when our research questions are inferential or predictive. Our goal in this stage of the lifecycle is to quantify how well we think the trends we find in our sample can generalize.


Source: Sam Lau, Joey Gonzalez, and Deb Nolan, https://www.textbook.ds100.org/intro.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Questions and Data Scope

As data scientists we use data to answer questions, and the quality of the data collection process can significantly impact the validity and accuracy of the data, the strength of the conclusions we draw from an analysis, and the decisions we make. In this chapter, we describe a general approach for understanding data collection and evaluating the usefulness of the data in addressing the question of interest. Ideally, we aim for data to be representative of the phenomenon that we are studying, whether that phenomenon is a population characteristic, a physical model, or some type of social behavior. Typically, our data do not contain complete information (the scope is restricted in some way), yet we want to use the data to accurately describe a population, estimate a scientific quantity, infer the form of a relationship between features, or predict future outcomes. In all of these situations, if our data are not representative of the object of our study, then our conclusions can be limited, possibly misleading, or even wrong.

To motivate the need to think about these issues, we begin with an example of the power of big data and what can go wrong. We then provide a framework that can help you connect the goal of your study (your question) with the data collection process. We refer to this as the data scope 1, and provide terminology to help describe data scope, along with examples from surveys, government data, scientific instruments, and online resources. Later in this chapter, we consider what it means for data to be accurate. There, we introduce different forms of bias and variation, and describe conditions where they can arise. Throughout, the examples cover the spectrum of the sorts of data that you may be using as a data scientists; these examples are from science, political elections, public health, and online communities.

Big Data and New Opportunities

The tremendous increase in openly available data has created new roles and opportunities in data science. For example, data journalists look for interesting stories in data much like how traditional beat reporters hunt for news stories. The data life cycle for the data journalist begins with the search for existing data that might have an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question.

Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are often made available to researchers who organize the project and often they are made available in repositories for the general public to further investigate.

The availability of administrative/organizational data creates other opportunities. Researchers can link data collected from scientific studies with, say medical data that have been collected for healthcare purposes; in other words, administrative data collected for reasons that don't directly stem from the question of interest can be useful in other settings. Such linkages can help data scientists expand the possibilities of their analyses and cross-check the quality of their data. In addition, found data can include digital traces, such as your web-browsing activity, posts on social media, and online network of friends and acquaintances, and can be quite complex.

When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional smaller research studies. We might even consider these large datasets as a replacement for scientific studies or essentially a census. This over-reach is referred to as the "big data hubris". Data with a large scope does not mean that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, dependency, and reliability. One well-known example is the Google Flu Trends tracking system.


Example: Google Flu Trends

Digital epidemiology, a new subfield of epidemiology, leverages data generated outside the public health system to study patterns of disease and health dynamics in populations 1 The Google Flu Trends (GFT) tracking system was one of the earliest examples of digital epidemiology. In 2007, researchers found that counting the searches people made for flu-related terms could accurately estimate the number of flu cases. It made headlines, and helped make researchers excited about the possibilities of big data. However, GFT did not live up to expectations and was abandoned in 2015.

What went wrong with GFT? After all, it used millions of digital traces from online queries for terms related to influenza to predict flu activity. Despite initial success, in the 2011–2012 flu season, Google's data scientists found that GFT was not a substitute for more traditionally collected data from the Centers for Disease Control (CDC) surveillance reports, collected from laboratories across the United States. In comparison, GFT overestimated the CDC numbers for 100 out of 108 weeks (see Figure 2.1). Week after week, GFT came in too high for the cases of influenza, even though it was based on big data

Fig. 2.1
Google Flu Trend (GFT) weekly estimates for influenza-like illness. For 108 weeks, GFT (solid line) over estimated the actual CDC reports (dashed line) 100 times. Also plotted are predictions from a model based on 3-week old CDC data and seasonal trends (dotted line).

Fig. 2.1 Google Flu Trend (GFT) weekly estimates for influenza-like illness.


Data scientists found that the GFT was not a substitute for more traditionally collected data from the CDC. A simple model built from past CDC reports that used 3-week-old CDC data and seasonal trends did a better job of predicting flu prevalence than GFT. That is, the GFT overlooks considerable information that could be extracted by basic statistical methods. This does not mean that big data captured from online activity is useless. In fact, researchers have shown that the combination of GFT data with CDC data can substantially improve on both GFT predictions and the CDC-based model [Lazer et al., 2014, Lazer, 2015]. It is often the case that combining different approaches leads to improvements over individual methods.

The GFT example shows us that even when we have tremendous amounts of information, the connections between the data, the topic of investigation, and the question being asked are paramount. Understanding this framework can help us avoid answering the wrong question, applying inappropriate methods to the data, and overstating our findings.

In the age of big data, we are tempted to collect more and more data. After all, a census gives us perfect information, so shouldn't big data be nearly perfect? A key factor to keep in mind is the scope of the data. What population do we want to study? How can we access information about that population? Who or what are we actually studying? Answers to these questions help us see potential gaps in our approach. This is the topic of the next section.

Target Population, Access Frame, Sample

An important initial step in the data life cycle is to express the question of interest in the context of the subject area and consider the connection between the question and the data collected to answer that question. It's good practice to do this before even thinking about the analysis or modeling steps because it may uncover a disconnect where the question of interest cannot be directly addressed with these data. As part of making the connection between the data collection process and the topic of investigation, we identify the population, the means of accessing the population, instruments of measurement, and additional protocols used in the collection process. These concepts help us understand the scope of the data, whether we aim to gain knowledge about a population, scientific quantity, physical model, social behavior, etc.

The target population consists of the collection of elements that you ultimately intend to describe and draw conclusions about. By element we mean those individuals that make up our population. The element may be a person in a group of people, a voter in an election, a tweet from a collection of tweets, or a county in a state. We sometimes call an element a unit or an atom.

The access frame is the collection of elements that are accessible to you for measurement and observation. These are the units by which you can access the target population. Ideally, the frame and population are perfectly aligned; meaning they consist of the exact same elements. However, the units in an access frame may be only a subset of the target population; additionally, the frame may include units that don't belong to the population. For example, to find out how a voter intends to vote in an election, you might call people by phone. Someone you call, may not be a voter so they are in your frame but not in the population. On the other hand, a voter who never answers a call from an unknown number can't be reached so they are in the population but not in your frame.

The sample is the subset of units taken from the access frame to measure, observe, and analyze. The sample gives you the data to analyze to make predictions or generalizations about the population of interest.

The contents of the access frame, in comparison to the target population, and the method used to select units from the frame to be in the sample are important factors in determining whether or not the data can be considered representative of the target population. If the access frame is not representative of the target population, then the data from the sample is most likely not representative either. And, if the units are sampled in a biased manner, problems with representativeness also arise.

You will also want to consider time and place in the data scope. For example, the effectiveness of a drug trial tested in one part of the world where a disease is raging might not compare as favorably with a trial in a different part of the world where background infection rates are lower. Additionally, data collected for the purpose of studying changes over time, like with the monthly measurements of CO2 in the atmosphere and the weekly reporting of Google searches for predicting flu trends have a temporal structure that we need to be mindful of as we examine the data. At other times, there might be spatial patterns in the data. For example, the environmental heath data, described later in this section, are reported for each census tract in the State of California, and we might, say, make maps to look for spatial correlations.

And, if you didn't collect the data, you will want to consider who did and for what purpose. This is especially relevant now since more data is passively collected instead of collected with a specific goal in mind. Taking a hard look at found data and asking yourself whether and how these data might be used to address your question can save you from making a fruitless analysis or drawing inappropriate conclusions.

For each of the following examples, we begin with a general question, narrow it to one that can be answered with data, and in doing so, we identify the target population, access frame, and sample. These concepts are represented by circles in a diagram of data scope, and the configuration of their overlap helps reveal key aspects the scope. Also in each example, we describe relevant temporal and spatial features to the data scope.

What makes members of an online community active? Content on Wikipedia is written and edited by volunteers who belong to the Wikipedia community. This online community is crucial to the success and vitality of Wikipedia. In trying to understand how to incentivize members of online communities, researchers carried out an experiment with Wikipedia contributors as subjects. A narrowed version of the general question asks: do informal awards increase the activity of Wikipedia contributors? For this experiment, the target population is the collection of active contributors - those who made at least one contribution to Wikipedia in the month before the start of the study. Additionally, the target population was further restricted to the top 1% of contributors. The access frame eliminated anyone in the population who had received an informal incentive that month. The access frame purposely excluded some of the contributors in the population because the researchers want to measure the impact of an incentive and those who had already received one incentive might behave differently. (See Figure 2.2).

Fig. 2.2 The access frame does not include the entire population because the experiment included only those contributers who had not already received an incentive. The sample is a randomly selected subset from the frame.

Fig. 2.2 The access frame does not include the entire population because the experiment included only those contributers who


The sample is a randomly selected set of 200 contributors from the frame. The sample of contributors were observed for 90 days and digital traces of their activities on Wikipedia were collected. Notice that the contributor population is not static; there is regular turnover. In the month prior to the start of the study more than 144,000 volunteers produced content for Wikipedia. Selecting top contributors from among this group limits the generalizability of the findings, but given the size of the group of top contributors, if they can be influenced by an informal reward to maintain or increase their contributions that is a valuable finding.

In many experiments and studies, we don't have the ability to include all population units in the frame. It is often the case that the access frame consists of volunteers who are willing to join the study/experiment.

Who will win the election? The outcome of the US presidential election in 2016 took many people and many pollsters by surprise. Most pre-election polls predicted Clinton would beat Trump by a wide margin. Political polling is a type of public opinion survey held prior to an election that attempts to gauge who people will vote for. Since opinions change over time, the focus is reduced to a "horse-race" question, where respondents are asked for whom would they vote in a head-to-head race if the election were tomorrow: Candidate A or Candidate B.

Polls are conducted regularly throughout the presidential campaign, and the closer to election day, we expect the polls to get better at predicting the outcome, as preferences stabilize. Polls are also typically conducted statewide, and later combined to make predictions for the overall winner. For these reasons, the timing and location of a poll matters. The pollster matters too; some have consistently been closer to the mark than others 1.

In these pre-election surveys, the target population consists of those who will vote in the election, which in this example was the 2016 US presidential election. However, pollsters can only guess at whether someone will vote in the election so the access frame consists of those deemed to be likely voters (this is usually based on their past voting record, but other factors may be used to determine this), and since people are contacted by phone, the access frame is limited to those who have a landline or mobile phone. The sample consists of those people in the frame who are chosen according to a random dialing scheme. (See Figure 2.3).

Fig. 2.3 This representation is typical of many surveys. The access frame does not cover all of the population and includes some who are not in the population.

Fig. 2.3 This representation is typical of many surveys. The access frame does not cover all of the population and includes s


Later, we discuss the impact on the election predictions of people's unwillingness to answer their phone or participate in the poll.

How do environmental hazards impact an individual's health? To address this question, the California Environmental Protection Agency (CalEPA), the California Office of Health Hazard Assessment (OEHHA), and the public developed the CalEnviroScreen project. The project studies connections between population health and environmental pollution in California communities using data collected from several sources that includes demographic summaries from the U.S. census, health statistics from the California Office of Statewide Health Planning and Development, and pollution measurements from air monitoring stations around the state maintained by the California Air.

Ideally, we want to study the people of California, and assess the impact of these environmental hazards on an individual's health. However, in this situation, the data can only be obtained at the level of a census tract. The access frame consists of groups of residents living in the same census tract. So, the units in the frame are census tracts and the sample is a census–all of the tracts–since data are provided for all of the tracts in the state. (See Figure 2.4).

Fig. 2.4 The grid in the access frame represents the census tracts. The population, frame, and sample cover all Californians, but the grid limits measurments to the level of census tract.

Fig. 2.4 The grid in the access frame represents the census tracts. The population, frame, and sample cover all Californians,


Unfortunately, we cannot deaggregate the information in a tract to examine what happens to an individual person. This aggregation impacts the question we can address and the conclusions that we can draw.

These examples have demonstrated some of the configurations a target, access frame, and sample might have, and the exercises at the end of this chapter provide a few more examples. When a frame doesn't reach everyone, we should consider how this missing information might impact our findings. Similarly we ask what might happen when a frame includes those not in the population. Additionally, the techniques for drawing the sample can affect how representative the sample is of the population. When you think about the generalizability of your data findings, you also want to consider the quality of the instruments and procedures used to collect the data. If your sample is a census that matches your target, but the information is poorly collected, then your findings will be of little value. This is the topic of the next section.

Instruments and Protocols

When we consider the scope of the data, we also consider the instrument being used to take the measurements and the procedure for taking measurements, which we call the protocol. For a survey, the instrument is typically a questionnaire that an individual in the sample answers. The protocol for a survey includes how the sample is chosen, how nonrespondents are followed up, interviewer training, protections for confidentiality, etc.

Good instruments and protocols are important to all kinds of data collection. If we want to measure a natural phenomenon, such as the speed of light, we need to quantify the accuracy of the instrument. The protocol for calibrating the instrument and taking measurements is vital to obtaining accurate measurements. Instruments can go out of alignment and measurements can drift over time leading to poor, highly inaccurate measurements.

Protocols are also critical in experiments. Ideally, any factor that can influence the outcome of the experiment is controlled. For example, temperature, time of day, confidentiality of a medical record, and even the order of taking measurements need to be consistent to rule out potential effects from these factors getting in the way.

With digital traces, the algorithms used to support online activity are dynamic and continually re-engineered. For example, Google's search algorithms are continually tweaked to improve user service and advertising revenue. Changes to the search algorithms can impact the data generated from the searches, which in turn impact systems built from these data, such as the Google Flu Trend tracking system. This changing environment can make it untenable to maintain data collection protocols and difficult to replicate findings.

Many data science projects involve linking data together from multiple sources. Each source should be examined through this data-scope construct and any difference across sources considered. Additionally, matching algorithms used to combine data from multiple sources need to be clearly understood so that populations and frames from the sources can be compared.

Measurements from an instrument taken to study a natural phenomenon can also be cast in the scope-diagram of a target, access frame, and sample. This approach is helpful in understanding their accuracy.

Measuring Natural Phenomenon

The scope-diagram introduced for observing a target population can be extended to the situation where we want to measure a quantity such as the count of particles in the air, the age of a fossil, the speed of light, etc. In these cases we consider the quantity we want to measure as an unknown value. (This unknown value referred is often referred to as a parameter). In our diagram, we shrink the target to a point that represents this unknown. The instrument's accuracy acts as the frame, and the sample consists of the measurements taken by the instrument within the frame. You might think of the frame as a dart board, where the instrument is the person throwing the darts. If they are reasonably good, the darts land within the circle, scattered around the bullseye. The scatter of darts correspond to the measurements taken by the instrument. The target point is not seen by the dart thrower, but ideally it coincides with the bullseye.

To illustrate the concepts of measurement error and the connection to sampling error, we examine the problem of calibrating air quality sensors.

How accurate is my air quality monitor? Across the US, sensors to measure air pollution are widely used by individuals, community groups, and state and local air monitoring agencies. For example, on two days in September, 2020, approximately 600,000 Californians and 500,000 Oregonians viewed PurpleAir's map as fire spread through their states and evacuations were planned. PurpleAir creates air quality maps from crowdsourced data that streams in from their sensors.

Fig. 2.5 This representation is typical of many measurement processes. The access frame represents the measurement process which reflects the accuracy of the instrument.

Fig. 2.5 This representation is typical of many measurement processes. The access frame represents the measurement process wh


We can think of the data scope as follows: at any location and point in time, there is a true particle composition in the air surrounding the sensor, this is our target. Our instrument, the sensor, takes many measurements, in some cases a reading every second. These form a sample contained in the access frame, the dart board. If the instrument is working properly, the measurements are centered around the bullseye, and the target coincides with the bullseye. Researchers have found that low humidity can distort the readings so that they are too high.

We continue the dart board analogy in the next section to introduce the concepts of bias and variation, describe common ways in which a sample might not be representative of the population, and draw connections between accuracy and the protocol.

Accuracy

In a census, the access frame matches the population, and the sample captures the entire population. In this situation, if we administer a well-designed questionnaire, then we have complete and accurate knowledge of the population and the scope is complete. Similarly in measuring air quality, if our instrument has perfect accuracy and is properly used, then we can measure the exact value of the air quality. These situations are rare, if not impossible. In most settings, we need to quantify the accuracy of our measurements in order to generalize our findings to the unobserved. For example, we often use the sample to estimate an average value for a population, infer the value of a scientific unknown from measurements, or predict the behavior of a new individual. In each of these settings, we also want a quantifiable degree of accuracy. We want to know how close our estimates, inferences, and predictions are to the truth.

The analogy of darts thrown at a dart board that was introduced earlier can be useful in understanding accuracy. We divide accuracy into two basic parts: bias and variance (also known as precision). Our goal is for the darts to hit the bullseye on the dart board and for the bullseye to line up with the unseen target. The spray of the darts on the board represents the variance in our measurements, and the gap from the bullseye to the unknown value that we are targeting represents the bias. Figure Figure 2.6 shows combinations of low and high bias and variance.

Fig. 2.6
In each of these diagrams, the dots represent the measurements taken. They form a scattershot within the access frame represented by the dart board. When the bullseye of the access frame is roughly centered on the targeted value (top row), the measurements are scattered around it and bias is low. The larger dart boards (right column) indicate a wider spread in the measurements.

Fig. 2.6 In each of these diagrams, the dots represent the measurements taken. They form a scattershot within the access fram


Representative data puts us in the top row of the diagram, where there is low bias, meaning that the bullseye and the unseen target are in alignment. Ideally our instruments and protocols put us in the upper left part of the diagram, where the variance is also low. The pattern of points in the bottom row systematically miss the targeted value. Taking larger samples will not correct this bias.

Types of Bias

Bias comes in many forms. We describe some classic types here and connect them to our target-access-sample framework.

  • Coverage bias occurs when the access frame does not include everyone in the target population. For example, a survey based on cell-phone calls cannot reach those with only a landline or no phone. In this situation, those who cannot be reached may differ in important ways from those in the access frame.
  • Selection bias arises when the mechanism used to choose units for the sample tends to select certain units more often than they should. As an example, a convenience sample chooses the units most easily available. Problems can arise when those who are easy to reach differ in important ways from those harder to reach. Another example of selection bias can happen with observational studies and experiments. These studies often rely on volunteers (people who choose to participate), and this self-selection has the potential for bias, if the volunteers differ from the target population in important ways.
  • Non-response bias comes in two forms: unit and item. Unit non-response happens when someone selected for a sample is unwilling to participate, and item non-response occurs when, say, someone in the sample refuses to answer a particular survey question. Non-response can lead to bias if those who choose not to participate or to not answer a particular question are systematically different from those who respond.
  • Measurement bias happens when an instrument systematically misses the target in one direction. For example, low humidity can systematically give us incorrectly high measurements of air pollution. In addition, measurement devices can become unstable and drift over time and so produce systematic errors. In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.

Each of these types of bias can lead to situations where the data are not centered on the unknown targeted value. Often we cannot assess the potential magnitude of the bias, since little to no information is available on those who are outside of the access frame, less likely to be selected for the sample, or disinclined to respond. Protocols are key to reducing these sources of bias. Chance mechanisms to select a sample from the frame or to assign units to experimental conditions can eliminate selection bias. A non-response follow-up protocol to encourage participation can reduce non-response bias. A pilot survey can improve question wording and so reduce measurement bias. Procedures to calibrate instruments and protocols to take measurements in, say, random order can reduce measurement bias.

In the 2016 US Presidential Election, non-response bias and measurement bias were key factors in the inaccurate predictions of the winner. Nearly all voter polls leading up to the election predicted Clinton a winner over Trump. Clinton's upset victory came as a surprise. After the election, many polling experts attempted to diagnose where things went wrong in the polls. The American Association for Public Opinion Research, found predictions made were flawed for two key reasons:

  • Over-representation of college-educated voters. College-educated voters are more likely to participate in surveys than those with less education, and in 2016 they were more likely to support Clinton. Non-response biased the sample and over-estimated support for Clinton.
  • Voters were undecided or changed their preferences a few days before the election. Since a poll is static and can only directly measure current beliefs, it cannot reflect a shift in attitudes.

It's difficult to figure out whether people held back their preference or changed their preference and how large a bias this created. However, exit polls have helped polling experts understand what happened, after the fact. They indicate that in battleground states, such as Michigan, many voters made their choice in the final week of the campaign, and that group went for Trump by a wide margin.

Bias does not need to be avoided under all circumstances. If an instrument is highly precise (low variance) and has a small bias, then that instrument might be preferable to another with higher variance and no bias. As an example, biased studies are potentially useful to pilot a survey instrument or to capture useful information for the design of a larger study. Many times we can at best recruit volunteers for a study. Given this limitation, it can still be useful to enroll these volunteers in the study and use random assignment to split them into treatment groups. That's the idea behind randomized controlled experiments.

Whether or not bias is present, data typically also exhibit variation. Variation can be introduced purposely by using a chance mechanism to select a sample, and it can occur naturally through an instrument's precision. In the next section, we identify three common sources of variation.


Types of Variation

Variation that results from a chance mechanism has the advantage of being quantifiable.

  • Sampling variation results from using chance to take a sample. We can in principle compute the chance a particular sample is selected.
  • Assignment variation of units to treatment groups in a controlled experiment produces variation. If we split the units up differently, then we can get different results from the experiment. This randomness allows us to compute the chance of a particular group assignment.
  • Measurement error for instruments result from the measurement process; if the instrument has no drift and a reliable distribution of errors, then when we take multiple measurements on the same object, we get variations in measurements that are centered on the truth.

The Urn Model is a simple abstraction that can be helpful for understanding variation. This model examines a container (an urn) full of identical marbles that have been labeled, and we use the simple action of drawing balls from the urn to reason about sampling schemes, randomized controlled experiments, and measurement error. For each of these types of variation, the urn model helps us estimate the size of the variation using either probability or simulation. The example of selecting Wikipedia contributors to receive an informal award provides two examples of the urn model.

Recall that for the Wikipedia experiment, a group of 200 contributors was selected at random from 1,440 top contributors. These 200 contributors were then split, again at random, into two groups of 100 each. One group received an informal award and the other didn't. Here's how we use the urn model to characterize this process of selection and splitting:

  • Imagine an urn filled with 1,440 marbles that are identical in shape and size, and written on each marble is one of the 1,440 Wikipedia usernames. (This is the access frame).
  • Mix the marbles in the urn really well, select one marble and set it aside.
  • Repeat the mixing and selecting of the marbles to obtain 200 marbles.

The marbles drawn form the sample. Then, to determine which of the 200 contributors receives the informal award, we work with another urn.

  • In a second urn, put in the 200 marbles from the above sample.
  • Mix these marbles well and select one marble and set it aside.
  • Repeat. That is, choose 100 marbles, one at a time, mixing in between, and setting the chosen marble aside.

The 100 drawn marbles are assigned to the treatment group and correspond to the contributors who receive the award. The 100 left in the urn form the control group and receive no award.

Both the selection of the sample and the choice of award recipients use a chance mechanism. If we were to repeat the first sampling activity again, returning all 1,440 the marbles to the original urn, then we would most likely get a different sample. This variation is the source of sampling variation. Likewise, if we were to repeat the random assignment process again (keeping the sample of 200 from the first step unchanged), then we would get a different treatment group. Assignment variation arises from this second chance process.

The Wikipedia experiment provided an example of both sampling and assignment variation. In both cases, the researcher imposed a chance mechanism on the data collection process. Measurement error can at times also be considered a chance process that follows an urn model. We characterize the measurement error in the air quality sensors in this way in the following example.

If we can draw an accurate analogy between variation in the data and the urn model, the urn model provides us the tools to estimate the size of the variation. This is highly desirable because way we can give concrete values for the variation in our data. However, it's vital to confirm that the urn model is a reasonable depiction of the source of variation. Otherwise, our claims of accuracy can be seriously flawed. Knowing as much as possible about data scope, including instruments and protocols and chance mechanism used in data collection, are needed to apply urn models.

Summary

No matter the kind of data you are working with, before diving into cleaning, exploration, and analysis, take a moment to look into the data's source. If you didn't collect the data, ask yourself:

  • Who collected the data?
  • Why were the data collected?

Answers to these questions can help determine whether these found data can be used to address the question of interest to you.

Consider the scope of the data. Questions about the temporal and spatial aspects of data collection can provide valuable insights:

  • When were the data collected?
  • Where were the data collected?

Answers to these questions help you determine whether your findings are relevant to the situation that interests you, or whether your situation that may not be comparable to this other place and time.

Core to the notion of scope are answers to the following questions:

  • What is the target population (or unknown parameter value)?
  • How was the target accessed?
  • What methods were used to select samples/take measurements?
  • What instruments were used and how were they calibrated?

Answering as many of these questions as possible can give you valuable insights as to how much trust you can place in your findings and how far you can generalize your findings.

This chapter has provided you with a terminology and framework for thinking about and answering these questions. The chapter has also outlined ways to identify possible sources of bias and variance that can impact the accuracy of your findings. To help you reason about bias and variance, we have introduced the following diagrams and notions:

  • Scope diagram to indicate the overlap between target population, access frame, and sample;
  • Dart board to describe an instrument's bias and variance; and
  • Urn model for situations when a chance mechanism has been used to select a sample from an access frame, divide a group into experimental treatment groups, or take measurements from a well calibrated instrument.

These diagrams and models attempt to boil down key concepts to understanding how to identify limitations and judge the usefulness of your data in answering your question. Next chapter continues the development of the urn model to more formally quantify accuracy and design simulation studies.

Simulation and Data Design

In this chapter, we develop the theory behind the chance processes introduced in the previous chapter. This theory makes the concepts of bias and variation more precise. We continue to motivate the accuracy of our data through the abstraction of an urn model that was first introduced in the previous chapter, and we use simulation studies to and helps us understand and make decisions based on the data.

We begin with an artificial example of a small population; it's so small that we can list all the possible samples that can be drawn from the population. Then, we consider simple variations on drawing marbles from the urn to extend the urn model to more complex sampling designs those used in complex surveys.

Next, we use the urn model as a technical framework to design and run simulation studies to understand larger and more complex situations. We return to some of the examples from the previous chapter and, for example, dive deeper into understanding how the pollsters might have gotten the 2016 Presidential Election predictions wrong. We use the actual votes cast in Pennsylvania to simulate the sampling variation for a poll of 1,400 from six million voters. This simulation helps us uncover how response bias can skew polls, and convince us that collecting a lot more data would not have helped the situation (another example of big data hubris).

In a second simulation study, we examine the efficacy of a COVID-19 vaccine. A designed experiment for the vaccine was carried out on over 50,000 volunteers. Abstracting the experiment to an urn model gives us a tool for studying assignment variation in randomized controlled experiments. Through simulation, we find the expected outcome of the clinical trial. Our simulation, along with careful examination of the data scope, debunks claims of vaccine ineffectiveness.

In addition, to sampling variation and assignment variation, we also cast measurement error in terms of an urn model. We use multiple measurements from different times of the day to estimate the accuracy of an air quality sensor. Later, we provide a more comprehensive treatment of measurement error and instrument calibration for air quality sensors.

The Urn Model

The urn model is a simple abstraction of the chance mechanism for drawing indistinguishable marbles from a container, an urn. The randomness in the selection process of drawing marbles from an urn can be extended to many chance processes in real-life examples, and we can simulate this random behavior and use our findings to better understand the accuracy of our data. To explain the urn model, we use a small example with seven marbles. The urn is small enough that we can list all possible outcomes that might result from drawing marbles from the urn.

Let's use the SpaceX Starship prototypes to make up a small example. The protoypyes are called SN1, SN2, …, where SN stands for "serial number", and in the first half of 2020, seven of these prototypes were built. Before deploying them a few were pressure tested. Suppose we want to select three of the seven Starship prototypes for pressure testing. (While this example is made up, the context is based on the actual SpaceX program; pressure tests were made on the Starship prototypes, SN1, SN2, …, SN7).

We set up the urn model as follows: write a unique label on each of seven marbles, place all the marbles in the urn, mix them well, and draw three without looking and without replacing marbles between draws. The urn is small enough that we can list all possible samples of three marbles that can be drawn:

ABC  ABD  ABE  ABF  ABG  ACD  ACE
ACF  ACG  ADE  ADF  ADG  AEF  AEG
AFG  BCD  BCE  BCF  BCG  BDE  BDF
BDG  BEF  BEG  BFG  CDE  CDF  CDG
CEF  CEG  CFG  DEF  DEG  DFG  EFG

We use the labels A, B, etc. rather than SN1, SN2, etc. because they are shorter and easier to distinguish. Our list shows that we could wind up with any one of the 35 unique sets of three from the seven marbles.

We draw an analogy to data scope from {numref}'Chapter %s ch:data_scope': a set of marbles drawn from the urn is a sample, and the collection of all marbles placed in the urn is the population. This particular urn model prescribes a particular selection method, called the Simple Random Sample (SRS). We describe the SRS and other sampling techniques based on the SRS in the next section.


Sampling Designs

The urn model for the SpaceX prototypes reduced to a few basics. We specified: the number of marbles in the urn; what is written on each marble; the number of marbles drawn from the urn; whether or not they were replaced between draws. This process is equivalent to a Simple Random Sample. Our example is a SRS of three draws from a population of seven, and by design, each of the 35 samples is equally likely to be chosen because the marbles are indistinguishable and well mixed. This means the chance of any one particular sample must be 1/35,

{\mathbb{P}}(ABC) = {\mathbb{P}}(\textrm{ABD}) = \cdots = {\mathbb{P}}(\textrm{EFG}) = \frac{1}{35}

We use the special symbol \mathbb{P}  to stand for "probability" or "chance", and we read the statement \mathbb{P}(ABC)

as the "chance the sample contains the marbles labeled A, B, and C".

We now have a more formal definition of "representative data" that is very useful: In a Simple Random Sample, every sample has the same chance of being selected.

Note

Many people mistakingly think that the defining property of a SRS is that every unit has an equal chance of being in the sample. However, this is not the case. A SRS of n units from a population of N means that every possible subset of n units has the same chance of being selected.

We can use the enumeration of all of the possible samples from the urn to answer additional questions about this chance process. For example, to find the chance that marble A is in the sample, we can add up the chance of all samples that contain A. There are 15 of them so the chance is:

{\mathbb{P}}(\textrm{A is in the sample}) = \frac{15}{35} = \frac{3}{7}

When it's too difficult to list and count all of the possible samples, we can use simulation to help understand this chance process.

The SRS (and its corresponding urn) is the main building block for more complex survey designs. We briefly describe two of the more widely used designs.

  • Stratified Sampling Divide the population into non-overlapping groups, called strata (one group is called a stratum and more than one are strata), and then take a simple random sample from each. This is like having a separate urn for each stratum and drawing marbles from each urn, independently. The strata do not have to be the same size, and we need not take the same number of units from each.
  • Cluster Sampling Divide the population into non-overlapping subgroups (these tend to be smaller than strata), take a simple random sample of the clusters, and include all of the units in a cluster in the sample. We can think on this as a SRS from one urn that contains large marbles that are themselves containers of small marbles. When opened, the sample of large marbles turns into the sample of small marbles.

Often, we are interested in a summary of the sample; that is, some statistic. For any sample, we can calculate the statistic, and the urn model helps us find the distribution of possible values that statistic may take on. In the next section, we examine the distribution of a statistic for our example.


Sampling Distribution of a Statistic

Suppose we are interested in whether or not the prototypes can pass a pressure test. It's expensive to carry out the pressure test, so we first test only a sample of prototypes. We can use the urn model to choose the protoypes to be pressure tested, and then, we can summarize our test results by, say, the proportion of prototypes that fail the test. The urn model provides us the knowledge that each of the 35 possible samples has the same chance of being selected and so the pressure test results are representative of the population.

For concreteness, suppose prototypes A, B, D, and F would fail the pressure test, if chosen. For each sample of three marbles, we can find the proportion of failures according to how many of these four defective prototypes are in the sample. Below are a few examples of this calculation.

Sample

ABC

BCE

BDF

CEG

Proportion

2/3

1/3

1

0


Since we are drawing three marbles from the urn, the only possible samples proportions are 0, 1/3, 2/3 and 1, and for each triple, we can calculate its corresponding proprotion. For example, there are 4 samples that give us all failed tests (a sample proportion of 1). These are: ABD, ABF, ADF, BDF, so the chance of observing a sample proportion of 1 is 4/35. Below we have summarized the distribution of values for the sample proportion into a table.

Proportion of Fails

No. of Samples

Fraction of Samples

1

4

4/35

2/3

18

18/35

1/3

12

12/35

0

1

1/35

Total

35

1


While these calculations are relatively straight forward, we can approximate them through a simulation study. To do this, we take samples of three from our population over and over, say 10,000 times. For each sample, we calculate the proportion of failures. That gives us 10,000 simulated sample proportions. The table of the simulated proportions should match the distribution table above. We confirm this with a simulation study.


Simulating the Sampling Distribution

Our original urn had seven marbles marked A through G. However, since we care only whether the prototype fails or passes the test, we can re-label each marble as 'fail' or 'pass'. We create this revised urn as an array.

urn = ['fail', 'fail', 'fail', 'fail', 'pass', 'pass', 'pass']

We simulate the draw of three marbles from our urn without replacement between draws using numpy's 'random.choice' method as follows.
np.random.choice(urn, size=3, replace=False)

array(['fail', 'pass', 'fail'], dtype='<U4')

Let's take a few more sample from our urn to see what the results might look like.

[np.random.choice(urn, size = 3, replace = False) for i in range(10)] 
[array(['pass', 'pass', 'fail'], dtype='<U4'),
 array(['pass', 'fail', 'pass'], dtype='<U4'),
 array(['fail', 'pass', 'fail'], dtype='<U4'),
 array(['fail', 'pass', 'fail'], dtype='<U4'),
 array(['fail', 'fail', 'fail'], dtype='<U4'),
 array(['pass', 'fail', 'pass'], dtype='<U4'),
 array(['fail', 'pass', 'fail'], dtype='<U4'),
 array(['pass', 'fail', 'fail'], dtype='<U4'),
 array(['pass', 'fail', 'pass'], dtype='<U4'),
 array(['pass', 'pass', 'fail'], dtype='<U4')]
Since we simply want to count the number of failures in the sample, it's easier if the marbles are labeled 1 for fail and 0 for pass. This way, we can sum the results of the three draws to get the number of failures in the sample. We re-label the marbles in the urn again, and compute the fraction of fails in a sample.

urn = [1, 1, 1, 1, 0, 0, 0]
sum(np.random.choice(urn, size=3, replace=False))/3
0.6666666666666666

We have stream-lined the process and we're now ready to carry out the simulation study. Let's repeat the process 100,000 times.

simulations = [sum(np.random.choice(urn, size=3, replace=False)) / 3
               for i in range(10000)]

We can study these 100,000 sample proportions and match our findings against what we calculated already using the table based on the enumeration of all 35 possible samples. We expect the simulation results to be close to our earlier calculations because we have repeated the sampling process many many times. That is, we want to compare the fraction of the 100,000 sample proportion that are 0, 1/3, 2/3, and 1 to those in the table. These fractions should be, approximately, 1/35, 12/35, 18/35, and 4/35, or about 0.03, 0.34, 0.51, and 0.11.

unique_els, counts_els = np.unique(np.array(simulations), return_counts=True)
np.array((unique_els, counts_els/10000))
array([[0.  , 0.33, 0.67, 1.  ],
       [0.03, 0.35, 0.51, 0.11]])



The simulation results closely match the table.

This simulation study does not prove, say, that we expect 18/35 samples to have two fails, but it does give us excellent approximations to our earlier calculations, which is reassuring. More importantly, when we have a more complex setting where it might be difficult to list all possibilities, a simulation study can offer valuable insights.

Note
A simulation study repeats a random process many many times. A summary of the patterns that result from the simulation can approximate the theoretical properties of the chance process. This summary is not the same as proving the theoretical properties, but often the guidance we get from the simulation is adequate for our purposes.

Drawing marbles from an urn with 0s and 1s is such a popular framework for understanding randomness that this chance process has been given a formal name, the hypergeometric. And, most software provide the functionality to rapidly carry out simulations of the hypergeomteric. We redo our simulation using the hypergeometric to complete this section.


The Hypergeometric

The version of the urn model where we count the number of marbles of a certain type (in our case 'fail' marbles) is so common that there is a random chance process named for it: the hypergeometric. Instead of using random.choice, we can use numpy's random.hypergeometric to simulate drawing marbles from the urn and counting the number of fails. The random.hypergeometric method is optimzed for the 0-1 urn and allows us to ask for 10,000 simulations in one call. For completeness, we repeat our simulation study and calculate the empirical proportions.

simulations_fast = np.random.hypergeometric(ngood=4, nbad=3, nsample=3, size=10000)
Note: we don't think that a pass is "bad"; it's just a naming convention to call the type you want to count "good" and the other "bad".

unique_els, counts_els = np.unique(np.array( simulations_fast ), return_counts=True)
np.array((unique_els, counts_els/10000))
array([[0.  , 1.  , 2.  , 3.  ],
       [0.03, 0.34, 0.52, 0.11]])
You might have asked yourself already - since the hypergeometric is so popular, why not provide the exact distribution of the possible values. In fact, these are available, and we show how to calculate them below.

from scipy.stats import hypergeom

x = np.arange(0, 4)
hypergeom.pmf(x, 7, 4, 3)
array([0.03, 0.34, 0.51, 0.11])
Perhaps the two most common chance processes are those that arise from counting the number of 1s drawn from a 0-1 urn: drawing without replacement is the hypergeometric and drawing with replacement is the binomial.

Note
Whenever possible, it's a good idea to use the functionality provided in a third party package for simulating from a named distribution, rather than writing your own function, such as the random number generators offered in numpy. It's best to take advanatge of efficient and accurate code that others have devloped.

While this simulation was simple, so simple that we could have used hypergeom.pmf to complete our distribution, we wanted to demonstrate the intuition that a simulation study can reveal. The approach we take in this book is to develop understanding about chance processes based on simulation studies. However, we do formalize the notion of a probability distribution of a statistics (like the proportion of fails in a sample) in Section %s.

Now that we have simulation as a tool for understanding accuracy, we can revisit the election example from the previous chapter and carry out a post-election study of what might have gone wrong with the voter polls. This simulation study imitates drawing more than a thousand marbles (voters who participate in the poll) from an urn of six million. We can examine potential sources of bias and the variation in the polling results, and carry out a what-if analysis, where we examine how the predictions might have gone if even a larger number of draws from the urn were taken.

Simulating Election Polls: Bias, Variance, and Big Data

The President of the US is chosen by the Electoral College, and not solely by popular vote. Each state is alotted a certan number of votes to cast in the Electoral College, according to the size of their population. Typically, whomever wins the popular vote in a state receives all of the electoral college votes for that state. With the aid of polls conducted in advance of the election, pundits identify "battleground" states where the election is expected to be close and the electoral college votes might swing the election.

In 2016, nearly every prediction for the outcome of the election was wrong. Pollsters correctly predicted the election outcome in 46 of the 50 states. For those 46 states, Trump received 231 and Clinton received 232 electoral college votes. The remaining 4 states, Florida, Michigan, Pennsylvania, and Wisconsin, were identified as battleground states and accounted for a total of 75 votes. The margins of the popular vote in these four states were narrow. For example, in Pennsylvania Trump received 48.18% and Clinton received 47.46% of the 6,165,478 votes cast in the state. Such narrow margins can make it hard to predict the outcome given the sample sizes that the polls used.

Many experts have studied the 2016 election results to dissect and identify what went wrong. According to the American Association for Public Opinion Research (AAPOR), one online, opt-in poll adjusted their polling results for the education of the respondents but used only three broad categories (high school or less, some college, and college graduate). They found that if they had separated out those with advanced degrees from those with college degrees, then they would have reduced Clinton's margin by 0.5 percentage points. In other words, after the fact they were able to identify an education bias where highly educated voters tended to be more willing to participate in polls. This bias matters because these voters also tended to prefer Clinton over Trump.

Now that we know how people actually voted, we can carry out a simulation study that imitates election polling under different scenarios to help develop intuition for accuracy, bias, and variance. We will simulate the polls for Pennsylvania under two scenarios:

  1. People surveyed didn't change their minds, didn't hide who they voted for, and were representative of those who voted on election day.
  2. People with a higher education were more likely to respond, which led to a 0.5 percentage point bias for Clinton.

Our ultimate goal is to understand the frequency that a poll incorrectly calls the election for Hillary Clinton when a sample is collected with absolutely no bias and also when there is a small amount of non-response bias. We begin by setting up the urn model for the first scenario.


The Urn Model

Our urn model for carrying out a poll of Pennsylvania voters is an after-the-fact situation where we use the outcome of the election. The urn would have 6,165,478 marbles in it, one for each voter. Like with our tiny population, we would write on each marble the candidate that they voted for, and we would draw 1500 marbles from the urn (1500 is about the typical size of the polls) and tally up the votes for Trump, Clinton, and any other candidate. From the tally, we can calculate Trump's lead over Clinton.

To set up our simulation, we figure out what the urn looks like. We need to know the number of votes cast for each of the candidates. Since we care only about Trump's lead over Clinton, we can lump together all votes for other candidates together. This way each marble has one of three possible votes: Trump, Clinton, and Other. (We can't ignore the "Other" category, because it impacts the size of the lead.)

proportions = np.array([0.4818, 0.4746, 1 - (0.4818 + 0.4746)])               
n = 1_500
N = 6_165_478
votes = np.trunc(N * proportions).astype(int)
votes

array([2970527, 2926135,  268814])

This version of the urn model has three types of marbles in it. It is a bit more complex than the hypergeometric, but still common enough to have a named distribution: the multivariate hypergeometric. In Python, the urn model with more than two types of marbles is implemented as the scipy.stats.multivariate_hypergeom.rvs method. The function returns the number of each type of marbel drawn from the urn. We call the function as follows.

from scipy.stats import multivariate_hypergeom

multivariate_hypergeom.rvs(votes, n)

array([747, 681,  72])

As before, each time we call multivariate_hypergeom.rvs we get a different sample and counts, e.g.,

multivariate_hypergeom.rvs(votes, n)

array([749, 697,  54])

We need to compute Trumnp's lead for each sample: (nT−nC)/n, where nT are the number of Trump votes in the sample and nC the number for Clinton. If the lead is positive, then the sample shows a win for Trump.

We know the actual lead was, 0.4818 - 0.4746 = 0.0072. To get a sense of the variation in the poll, we can simulate the chance process of drawing from the urn over and over and examine the values that we get in return. Below we simulate 100,000 polls of 1500 voters from the votes cast in Pennsylvania.

def trump_advantage(votes, n):
    sample_votes = multivariate_hypergeom.rvs(votes, n)
    return (sample_votes[0] - sample_votes[1]) / n

simulations = [trump_advantage(votes, n) for _ in range(100000)]

On average, the polling results show Trump with close to a 0.7% lead, as expected given the composition of the six-plus million votes cast.

np.mean(simulations)

0.007210106666666665

However, many times the lead in the sample was negative, meaning Clinton was the winner for that sample of voters. The histogram below shows the sampling distribution of Trump's advantage in Pennsylvania for a sample of 1500 voters. The vertical dashed line at 0 shows that more often than not, Trump is called, but there are many times when the poll of 1,500 shows Clinton in the lead.

Text(0.5, 0, 'Trump Lead in the Sample')

trump lead in the sample


In the 100,0000 simulated polls, we find Trump a victor about 60% of the time:

np.count_nonzero(np.array(simulations) > 0) / 100000

0.60797

In other words, a given sample will correctly predict Trump's victory even if the sample was collected with absoutely no bias about 60% of the time. In other words, a non-biased sample will be wrong about 40% of the time.

We have used the urn model to study the variation in a simple poll, and we found how a poll's prediction might look if there was no bias in our selection process (the marbles are indistinguishable and every possible collection of 1500 marbles of the six-plus million marbles is equally likely). Next, we will see what happens when a little bias enters into the mix.


An Urn Model with Bias

"In a perfect world, polls sample from the population of voters, who would state their political preference perfectly clearly and then vote accordingly". That's the simulation study that we just performed. In reality, it is often difficult to control for every source of bias.

We investigate here the effect of a small, education bias on the polling results. Specifically, we examine the impacts of the 0.5 percent bias in favor of Clinton that we described earlier. This bias essentially means that we see a distorted picture of voter preferences in our poll. Instead of 47.46 percent votes for Clinton, we have 47.96, and we have 48.18 - 0.5 = 47.68 percent for Trump. We adjust the proportions of marbles in the urn to reflect this bias.

proportions_bias = np.array([0.4818 - 0.005, 0.4747 + 0.005, 1 - (0.4818 + 0.4746) ])
proportions_bias

array([0.48, 0.48, 0.04])

votes_bias = np.trunc(N * proportions_bias).astype(int)
votes_bias

array([2939699, 2957579,  268814])

Now, we carry out the simulation study again, this time with the biased urn, and find how often in the 100,000 samples was Trump in the lead.
simulations_bias = [trump_advantage(votes_bias, n) for i in range(100000)] 

Text(0.5, 0, 'Trump Lead in the Sample')



np.count_nonzero(np.array(simulations_bias) > 0) / 100000

0.44742
Now, Trump would have a positive lead in about 45% of the polls. Notice that the histograms from the two simulations are similar in shape. They are symmetric with reasonable length tails. That is, they appear to roughly follow the normal curve. The second histogram is shifted slightly to the left, which reflects the non-response bias we introduced. Would increasing the sample size have helped? This is the topic of the next section.


Conducting Larger Polls

With our simulation study we can gain insight on the impact of a larger poll on the sample lead. For example, we can try a sample size of 12,000, eight times the size of the actual poll, and run 100,000 simulations for both scenarios: the unbiased and the biased.

simulations_big = [trump_advantage(votes, 12000) for i in range(100000)] 
simulations_bias_big = [trump_advantage(votes_bias, 12000) for i in range(100000)]

scenario_no_bias = np.count_nonzero(np.array(simulations_big) > 0) / 100000
scenario_bias = np.count_nonzero(np.array(simulations_bias_big) > 0) / 100000
print(scenario_no_bias, scenario_bias)

0.78664 0.37131

The simulation shows that Trump's lead is detected in only about one-third of the simulated polls. The spread of the histogram (below) of these results is narrower than the spread when only 1,500 voters were polled. Unfortunately, it's narrowing in on the wrong value. We haven't overcome the bias; we just have a more accurate picture of the biased situation. Big data has not come to the rescue. Additionaly, larger polls have other problems. They are often harder to conduct because pollsters are working with limited resources and efforts that go into improving the data scope are being redirected to expanding the poll.
Text(0.5, 0, 'Trump Lead in the Sample')



After the fact, with multiple polls for the same election, we can detect bias. In a post-election analysis of over 4,000 polls for 600 state-level, gubernatorial, senatorial, and presidential elections, it was found that on average election polls exhibit a bias of about 1.5 percentage points.

When the margin of victory is relatively small as it was in 2016, a larger sample size reduces the sampling error, but unfortunately, if there is bias, then the predictions are close to the biased estimate. If the bias pushes the prediction from one candidate (Trump) to another (Clinton), then we have a "surprise" upset. Pollsters develop voter selection schemes that attempt to reduce bias, like the separation of voters preference by education level, but, as in this case, it can be difficult, even impossible, to account for new, unexpected sources of bias. Polls are still useful, we just need to do a better job.

The polls that we simulated in this example, were the simplest kind. Indeed, their formal name is the Simple Random Sample. We connect the urn model with survey sampling in Section XX. Before that, we show how the urn model can be used for data collected by other means, such as from a randomized controlled experiment. That is the topic of the next section.

Simulating a Randomized Trial: Vaccine Efficacy

In a drug trial, volunteers for the trial either receive the new treatment or a placebo (a fake treatment). In an A/B test of a new feature on a Web site, visitors to the site would either see the new feature or the usual Web page. In both examples, we control the assignment of volunteers and visitors to groups, and in a randomized controlled experiment, we use a chance process to make the assignment.

In drug trials, scientists often essentially use an urn model model to select the subjects for the treatment, and those not selected receive the placebo. With A/B testing, we often use a systematic approach, where, for example, every other visitor to the page is shown the new feature (see the exercises to learn more about systematic sampling). We can simulate the chance mechanism of the urn to better understand variation in the outcome of an experiment and the meaning of efficacy in clinical trials.

Detroit Mayor Mike Duggan made national news in March 2021 when he turned down a shipment of over 6,000 Johnson & Johnson vaccine doses stating that the citiziens of his city should "get the best". The mayor was refering to the efficacy rate of the vaccine, which was reported to be about 66%. In comparison, Moderna and Pfizer both reported efficacy rates of about 95% for their vaccines.

On the surface, Duggan's reasoning seems valid, but the scope of the three clinical trials are not comparable, meaning direct comparisons of the experiments is problematic. Moreover, the Centers for Disease Control (CDC) considers a 66% efficicay rate quite good, which is why it was given emergency approval.

We consider these points in turn, beginning with scope and then efficacy.


Scope

Recall that when we evaluate the scope of the data, we consider the who, when, and where of the study. For the Johnson & Johnson clinical trial, the participants:

  • included adults 18 and over, where roughly 40% had conditions, called comorbidities, associated with an increased risk for getting severe COVID-19;
  • enrolled in the study from October to November, 2020;
  • came from 8 countries across 3 continents, including the US and South Africa.

The participants in the Moderna and Pfizer trials were primarily from the US, roughly 40% had comorbidities for severe COVID-19, and the trial took place earlier, over summer 2021. The timing and location of the trials make them difficult to compare. Cases of COVID-19 were at a low point in the summer in the US, but they rose rapidly in the late fall. Also, a variant of the virus that is more contagious was spreading rapidly in South Africa at the time of the J&J trial.

Each clinical trial was designed to test a vaccine against the situation of no vaccine under similar circumstances through the random assignment of subjects to treatment and control groups. While the scope from one trial to the next are quite different, the randomization within a trial keeps the scope of the treatment and control groups roughly the same, which enables meaningful comparisons between groups in the same trial. The scope was different enough across the three vaccine trials to make direct comparisons problematic.

How was the trial carried out for the Johnson & Johnson vaccine? To being, 43,738 people enrolled in the trial. These participants were split into two groups at random. Half received the new vaccine, and the other half received a placebo, such as a saline solution. Then, everyone was followed for 28 days to see whether they contracted COVID-19.

A lot of information was recorded on each patient, such as their age, race, and sex, and in addition, whether they caught COVID, including the severity of the disease. At the end of 28 days, they found 468 cases of COVID-19, 117 of these were in the treatment group, and 351 in the control group.

The random assignment of patients to treatment and control, gives the scientists a framework to assess the effectiveness of the vaccine. The typical reasoning goes as follows:

  • Begin with the assumption that the vaccine is ineffective
  • So, the 468 who caught COVID-19 would have caught it whether or not they received the vaccine
  • And, the remaining 43,270 people in the trial who did not get sick would have remained healthy whether or not they received the vaccine.
  • The split of 117 sick people in treatment and 351 in control was solely due to the chance process in assigning participants to treatment or control.

We can set up an urn model that reflects this scenario and then study, via simulation, the behavior of the experimental results.


The Urn Model

Our urn has 43,738 marbles, one for each person in the clinical trial. Since there were 468 cases of COVID-19 among them, we label 468 marbles with a 1 and the remaining 43,270 with 0. We draw half the marbles (21,869) from the urn to receive the treatment, and the remaining half receive the placebo. The results of the experiment are simply the count of the number of marbles marked 1 that were randomly drawn from the urn.

We can simulate this process to get a sense of how likely it would be under these assumptios to draw only 117 marbles marked 1 from the urn. Since we draw half of the marbles from the urn, we would expect about half of the 468, or 234, to be drawn. The simulation study gives us a sense of the variation that might result from the random assignment process. That is, the simulation can tell us what proportion of the randomly trials would result in so few cases of the virus in the treatment group.

Note

There are several key assumptions that enter into the urn model, such as the assumption that the vaccine is ineeffective. The random assignment of patients to treatement enables us to carry out a simulation study. It's important to keep track of the reliance on these assumptions. Our simulation study gives us an approximation of the rarity of an outcome like the one observed only under these key assumptions.

As before, we can simulate the urm model using the hypergeometric probability distribution, rather than having to program the urn sampling from scratch.

simulations_fast = np.random.hypergeometric(ngood=468, nbad=43270, nsample=21869, size=500000)

Text(0.5, 0, 'Cases in the Treatment Group')



In our simulation, we repeated the process of random assignment to the treatment group 500,000 times. Indeed, we found not one of the 500,000 simulations had as few as 117 cases or fewer. It would be an extremely rare event to see so few cases of COVID-19, if in fact the vaccine was not effective. A more useful computation is the vaccine's efficiacy, which we describe next.


Vaccine Efficacy

\dfrac{\textrm{Risk among unvaccinated group} − \textrm{Risk among vaccinated group}} {\textrm{Risk among unvaccinated group}},

where, the risk among the unvaccinated is the proportion of unvaccinated who contracted COVID and similarly, the risk among the vaccinated is the proportion of vaccinated who contracted COVID. Since the two groups have the same number in each, we can ignore denomiators and compute the efficacy as follows.

(351 - 117) / 351

0.6666666666666666
The Centers for Disease Control sets a standard of 50% for VE, when deciding whether to adopt a new vaccine. This would be equivalent to how many cases in the treatment group?
468/3

156.0
We can see from the histogram that none of the simulations yielded as few or fewer than 156 cases in the treatment group.

Furthermore, many scientists argued that we should look at the efficacy for preventing severe cases of Covid. In this scenario, the J&J vaccine was over 80% effective. Additionally, no deaths were observed in the treatment group.

After the problems with comparing drug trials that have different scopes and the efficacy for preventing severe cases of COVID-19 was explained, the Mayor of Detroit retracted his original statement, saying "I have full confidence that the Johnson & Johnson vaccine is both safe and effective".

This example has shown that:

  • Using a chance process in the assignment of subjects to treatments in clinical trials can help us answer what-if scenarios;
  • Considering data scope can help us determine whether it is reasonable to compare figures from different datasets.

Another example of the usefulness of the urn model is in the measurement error.

Measurement Error: Air Quality Variation

Simulating the draw of marbles from an urn is a useful abstraction for studying the possible outcomes from survey samples and controlled experiments. The simulation works becuase it imitates the chance mechanism used to select the sample or to assign the treatment. In many settings, measurement error also follows a similar chance process. As mentioned, instruments typically have an error associated with them, and by taking repeated measurements on the same object, we can quantify the variability associated with the instrument.

As an example, let's look at data from a PurpleAir sensor that measures air quality. PurpleAir provides a data download tool so anyone can access air quality measurements by interacting with their map. These data are available in 2-minute intervals for any sensor appearing on their map. To get a sense of the size of the variations in measurements for a sensor, we downloaded data for one sensor from a 24-hour period and selected five 60-minute periods throughout the day to examine, giving us thirty consecutive measurements for a total of 150 measurements. These are available in 'data/purpleAir2minsample.csv'.

pm = pd.read_csv('data/purpleAir2minsample.csv')
pm

aq2.5 time hour diffs meds
0 6.14 2022-04-01 00:01:10 UTC 0 0.77 5.38
1 5.00 2022-04-01 00:03:10 UTC 0 -0.38 5.38
2 5.29 2022-04-01 00:05:10 UTC 0 -0.09 5.38
... ... ... ... ... ...
147 8.08 2022-04-01 19:57:20 UTC 19 -0.47 8.55
148 7.38 2022-04-01 19:59:20 UTC 19 -1.18 8.55
149 7.26 2022-04-01 20:01:20 UTC 19 -1.29 8.55

150 rows × 5 columns

The feature aq2.5 refers to the amount of particulate matter measured in the air that has a diameter smaller than 2.5 micrometers (the unit of measurement is micrograms per cubic meter: μg/m3). These measurements are 2-minute averages. The scatter plot below gives us a sense of variation in the instrument. For each hour, we expect the measurements to be roughly the same, and so give us a sense of the variability in the instrument. At 11 in the morning, the measurements clump around 7, and five hours later, they cluster around 10 or so. We see that the level of particulate matter changes over the course of a day, and in any one hour most measurements are within about +/-1 of the median.


To get a better sense of this variation, we examine the differences of each individual measurement from the median for the hour. Below is a histogram of these differences.
Text(0.5, 0, 'Deviation from Hourly Median')



The histogram shows us that typical error for PM2.5 measurements from this instrument is about 1. Given the measurements range from 4 to 13, we find the instrument is relatively accurate compared to the typical measurements. Unfortunately, what we don't know is whether the measurements are close to the true air quality at that time and place. To detect bias in the instrument, we need to make comparisons against a more accurate instrument or take measurements in a protected environment where the air has a known quantity of PM2.5.  The remainder of this chapter is dedicated to more formal treatment of these concepts using probability. We begin by tackling random sampling schemes.

Summary

In this chapter, we used the analogy of drawing marbles for an urn to model random sampling from populations, random assignment of subjects to treatments in experiments, and measurement error. This framework enables us to run simulation studies for a hypothetical survey, experiment, or some other chance process in order to study their behavior. We found the expected outcome of a clinical trial for a vaccine under the assumption that the treatment was not effective; and, we studied the support for Clinton and Trump with sample that used the actual votes cast in the election. These simulation studies enabled us to quantify the typical deviations in the chance process and to approximate the distribution of summary statistics. That is, simulation studies can reveal the sampling distribution of a statistic, and help us answer questions about the likelihood of observing results like ours under the urn model for variation.

The urn model reduces to a few basics: the number of marbles in the urn; what is written on each marble; the number of marbles to draw from the urn; and whether or not they are replaced between draws. From there we can simulate more and more complex data designs. However, the crux of the urn's usefulness is the mapping from the data design to the urn. If samples are not randomly drawn and subjects are not randomly assigned to treatments, then this framework can't help us understand our data and make decisions. On the other hand, we also need to remember that the urn is a simplification of the actual data collection process. If in reality, there is bias in data collection, then the randomness we observe in the simulation doesn't capture the complete picture. Too often, data scientists wave these annoyances away and address only the variability described by the urn model. That was one of the main issues in the surveys prediciting the outcome of the 2016 US Presidential election.

Exercises

  1. In cluster sampling, the population is divided into non-overlapping subgroups, which tend to be smaller than strata. The sampling method is to take a simple random sample of the clusters and include all of the units in a cluster in the sample. Use our urn analogy, to express cluster sampling. As a simple example, suppose our population of 7 starship prototypes are placed into 4 clusters as follows: (A,B) (C,D) (E,F) (G). Suppose we take a SRS of 2 clusters.
  1. List all of the possible samples that might result.
  2. What is the chance that A is in a sample?
  3. What is the chance that A, C and E are in the sample?

Cluster sampling has a distinct advantage of making sample collection easier. For example, it is much easier to poll 100 homes of 2-4 people each than to poll 300 individuals. But, since people in a cluster tend to be similar to each other, we need to keep the sampling procedure in mind as we generalize from sample to population.

  1. Systematic sampling is another popular technique. To start, the population is ordered, and the first unit is selected at random from the first k elements. Then, every k^{th} unit after that is placed in the sample. As a simple example, suppose our population of 7 prototypes is ordered alphabetically and we select one from the first three A,B at random, and then every second element after that.
  1. List all of the possible samples that might result.
  2. What is the chand that A is in the sample?
  3. What is the chance that A and B are in the sample? A and C?

Intercept surveys are when a popup window asks you to complete a brief questionnaire. If every k^{th} visitor to a website is asked to complete a brief survey, then we have a systematice sample. Here the population consists of visits to the site, and the ordering for systematic sampling, is the order of the visits. It seems reasonable to imagine that this ordering wouldn't introduce a selection bias in the sampling process.