The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Questions and Data Scope

Instruments and Protocols

When we consider the scope of the data, we also consider the instrument being used to take the measurements and the procedure for taking measurements, which we call the protocol. For a survey, the instrument is typically a questionnaire that an individual in the sample answers. The protocol for a survey includes how the sample is chosen, how nonrespondents are followed up, interviewer training, protections for confidentiality, etc.

Good instruments and protocols are important to all kinds of data collection. If we want to measure a natural phenomenon, such as the speed of light, we need to quantify the accuracy of the instrument. The protocol for calibrating the instrument and taking measurements is vital to obtaining accurate measurements. Instruments can go out of alignment and measurements can drift over time leading to poor, highly inaccurate measurements.

Protocols are also critical in experiments. Ideally, any factor that can influence the outcome of the experiment is controlled. For example, temperature, time of day, confidentiality of a medical record, and even the order of taking measurements need to be consistent to rule out potential effects from these factors getting in the way.

With digital traces, the algorithms used to support online activity are dynamic and continually re-engineered. For example, Google's search algorithms are continually tweaked to improve user service and advertising revenue. Changes to the search algorithms can impact the data generated from the searches, which in turn impact systems built from these data, such as the Google Flu Trend tracking system. This changing environment can make it untenable to maintain data collection protocols and difficult to replicate findings.

Many data science projects involve linking data together from multiple sources. Each source should be examined through this data-scope construct and any difference across sources considered. Additionally, matching algorithms used to combine data from multiple sources need to be clearly understood so that populations and frames from the sources can be compared.

Measurements from an instrument taken to study a natural phenomenon can also be cast in the scope-diagram of a target, access frame, and sample. This approach is helpful in understanding their accuracy.