The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

The Data Science Lifecycle

Data science is a rapidly evolving field. At the time of this writing people are still trying to pin down exactly what data science is, what data scientists do, and what skills data scientists should have. What we do know, though, is that data science uses a combination of methods and principles from statistics and computer science to draw insights from data. We use these insights to make all sorts of important decisions. Data science helps assess whether a vaccine works, filter out spam from our email inboxes, and advise urban planners where to build new housing.

This book covers fundamental principles and skills that data scientists need to perform analyses. To help you remember the bigger picture, we've organized these topics around a workflow for analysis that we call the data science lifecycle. This chapter introduces the data science lifecycle. It also provides a map for the rest of the book by showing you where each chapter fits into the lifecycle. Unlike other books that focus on one part of the lifecycle, this book covers the entire cycle from start to finish. We explain theoretical concepts and show how they work in practical case studies. Throughout the book, we rely on real data from analyses by other data scientists, not made-up data, so you can learn how to perform your own data analyses and draw sound conclusions.

Fig. 1.1
This diagram of the data science lifecycle shows its four high-level steps. The arrows show how the steps lead into one another.

Fig. 1.1 This diagram of the data science lifecycle shows its four high-level steps. The arrows show how the steps lead into


Figure 1.1 shows the data science lifecycle. It's split into four stages: asking a question, obtaining data, understanding the data, and understanding the world. We've made these stages very broad on purpose. In our experience, the mechanics of a data analysis change fequently. Computer scientists and statisticians continue to build new software packages and programming languages for analysis, and they develop new analysis techniques that are more accurate and specialized. Despite these changes, we've found that almost every data analysis follows the four steps in our lifecycle. The first is to ask a question.

Asking a Question

Asking questions lies at the heart of data science because different kinds of questions require different kinds of analyses. For example, "How have house prices changed over time?" is very different from "How will this new law affect house prices?". Understanding our research question tells us what data we need, the patterns to look for, and how we should interpret our results. In this book, we focus on three broad categories of questions: exploratory, inferential, and predictive.

Exploratory questions aim to find out information about the data that we have. For example, we can use environmental data to ask: have average global temperatures risen in the past 40 years? The key part of an exploratory question is that it aims to summarize and interpret trends in the data without quantifying whether these trends will hold in data that we don't have. "How many people voted in the last election?" is an exploratory question. "How many people will vote in the next election?" is not an exploratory question.

Inferential questions, on the other hand, do quantify whether trends found in our data will hold in unseen data. Let's say we data from a sample of hospitals across the US. We can ask whether air pollution is correlated with lung disease for the individuals in our sample - this is an exploratory question. We can also ask whether air pollution is correlated with lung disease for the entire US - this is an inferential question, since we're using our sample to infer a correlation for the entire US.

Note

Be careful not to confuse an inferential question with a question about causality. An inferential question asks whether a correlation exists. "Are people who are exposed to more air pollution more likely to develop lung disease?" is an inferential question. "Does air pollution cause lung disease?" is causal, not inferential. We typically cannot answer causal questions unless we have a randomized experiment (or assume one).

Predictive questions, like inferential questions, aim to quantify trends for unseen data. While inferential questions look for trends in the population, predictive questions aim to make predictions for individuals. An inferential question could ask: "What factors increase voter turnout in the US?" A predictive question could instead ask: "Given a person's income and education, how likely are they to vote?"

As we do a data analysis, we often change and refine our research questions. Each time we do so, it's important to consider what kind of question we want to answer.

In the next section, we'll talk about how our question affects the data we want to obtain.

Obtaining Data

In this step of the data science lifecycle, we obtain our data and understand how the data were collected. One of our goals in this stage is to understand what kinds of research questions we can answer using the data that we have. In our lifecycle, data analyses can begin with asking a question (the previous stage) or with obtaining data (this stage). When data are expensive and hard to gather, we define a precise research question first and then collect the exact data we need to answer the question. Other times, data are cheap and easily accessed. This is especially true for online data sources. For example, the Twitter website lets people quickly download millions of data points 1. When data are plentiful, we can also start an analysis by obtaining data, exploring it, and then asking research questions.

When we obtain data, we write down how the data were collected and what information the data contain. This isn't just for bookkeeping - the type of research questions we can answer depend greatly on the way the data were collected.


Understanding the Data

After obtaining data, we want to understand the data we have. A key part of understanding the data is doing exploratory data analysis, where we often create plots to uncover interesting patterns and summarize the data visually. We also look for problems in the data. Most real-world datasets have missing values, weird values, or other anomalies that we need to account for.

In our experience, this stage of the lifecycle is highly iterative. Understanding the data can lead to any of the other stages in the data science lifecycle. As we understand the data more, we often revise our research questions, or realize that we need to get data from a different source.

This stage incorporates both programming and statistical knowledge. To manipulate data, we write programs that clean data, transform data, and create plots. To find patterns and trends in the data, we use summary statistics and statistical models.

When our research questions are purely exploratory, we are only concerned about patterns in the data. In these cases, our analysis can end at this stage of the lifecycle. When our research questions are inferential or predictive, however, we proceed to the next stage of the lifecycle: understanding the world.


Understanding the World

In this stage of the lifecycle, we draw conclusions about our larger population, and sometimes even the world. The challenge is that our data are typically a relatively small sample compared to our population. To draw conclusions about the population, we use statistical techniques like A/B testing, confidence intervals, and simulation. We also use models like linear or logistic regression. This stage is relevant when our research questions are inferential or predictive. Our goal in this stage of the lifecycle is to quantify how well we think the trends we find in our sample can generalize.


Source: Sam Lau, Joey Gonzalez, and Deb Nolan, https://www.textbook.ds100.org/intro.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.