The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Simulation and Data Design

Measurement Error: Air Quality Variation

Simulating the draw of marbles from an urn is a useful abstraction for studying the possible outcomes from survey samples and controlled experiments. The simulation works becuase it imitates the chance mechanism used to select the sample or to assign the treatment. In many settings, measurement error also follows a similar chance process. As mentioned, instruments typically have an error associated with them, and by taking repeated measurements on the same object, we can quantify the variability associated with the instrument.

As an example, let's look at data from a PurpleAir sensor that measures air quality. PurpleAir provides a data download tool so anyone can access air quality measurements by interacting with their map. These data are available in 2-minute intervals for any sensor appearing on their map. To get a sense of the size of the variations in measurements for a sensor, we downloaded data for one sensor from a 24-hour period and selected five 60-minute periods throughout the day to examine, giving us thirty consecutive measurements for a total of 150 measurements. These are available in 'data/purpleAir2minsample.csv'.

pm = pd.read_csv('data/purpleAir2minsample.csv')
pm

aq2.5 time hour diffs meds
0 6.14 2022-04-01 00:01:10 UTC 0 0.77 5.38
1 5.00 2022-04-01 00:03:10 UTC 0 -0.38 5.38
2 5.29 2022-04-01 00:05:10 UTC 0 -0.09 5.38
... ... ... ... ... ...
147 8.08 2022-04-01 19:57:20 UTC 19 -0.47 8.55
148 7.38 2022-04-01 19:59:20 UTC 19 -1.18 8.55
149 7.26 2022-04-01 20:01:20 UTC 19 -1.29 8.55

150 rows × 5 columns

The feature aq2.5 refers to the amount of particulate matter measured in the air that has a diameter smaller than 2.5 micrometers (the unit of measurement is micrograms per cubic meter: μg/m3). These measurements are 2-minute averages. The scatter plot below gives us a sense of variation in the instrument. For each hour, we expect the measurements to be roughly the same, and so give us a sense of the variability in the instrument. At 11 in the morning, the measurements clump around 7, and five hours later, they cluster around 10 or so. We see that the level of particulate matter changes over the course of a day, and in any one hour most measurements are within about +/-1 of the median.


To get a better sense of this variation, we examine the differences of each individual measurement from the median for the hour. Below is a histogram of these differences.
Text(0.5, 0, 'Deviation from Hourly Median')



The histogram shows us that typical error for PM2.5 measurements from this instrument is about 1. Given the measurements range from 4 to 13, we find the instrument is relatively accurate compared to the typical measurements. Unfortunately, what we don't know is whether the measurements are close to the true air quality at that time and place. To detect bias in the instrument, we need to make comparisons against a more accurate instrument or take measurements in a protected environment where the air has a known quantity of PM2.5.  The remainder of this chapter is dedicated to more formal treatment of these concepts using probability. We begin by tackling random sampling schemes.