The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Questions and Data Scope

Summary

No matter the kind of data you are working with, before diving into cleaning, exploration, and analysis, take a moment to look into the data's source. If you didn't collect the data, ask yourself:

  • Who collected the data?
  • Why were the data collected?

Answers to these questions can help determine whether these found data can be used to address the question of interest to you.

Consider the scope of the data. Questions about the temporal and spatial aspects of data collection can provide valuable insights:

  • When were the data collected?
  • Where were the data collected?

Answers to these questions help you determine whether your findings are relevant to the situation that interests you, or whether your situation that may not be comparable to this other place and time.

Core to the notion of scope are answers to the following questions:

  • What is the target population (or unknown parameter value)?
  • How was the target accessed?
  • What methods were used to select samples/take measurements?
  • What instruments were used and how were they calibrated?

Answering as many of these questions as possible can give you valuable insights as to how much trust you can place in your findings and how far you can generalize your findings.

This chapter has provided you with a terminology and framework for thinking about and answering these questions. The chapter has also outlined ways to identify possible sources of bias and variance that can impact the accuracy of your findings. To help you reason about bias and variance, we have introduced the following diagrams and notions:

  • Scope diagram to indicate the overlap between target population, access frame, and sample;
  • Dart board to describe an instrument's bias and variance; and
  • Urn model for situations when a chance mechanism has been used to select a sample from an access frame, divide a group into experimental treatment groups, or take measurements from a well calibrated instrument.

These diagrams and models attempt to boil down key concepts to understanding how to identify limitations and judge the usefulness of your data in answering your question. Next chapter continues the development of the urn model to more formally quantify accuracy and design simulation studies.