The Data Science Lifecycle

To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.

Questions and Data Scope

Big Data and New Opportunities

The tremendous increase in openly available data has created new roles and opportunities in data science. For example, data journalists look for interesting stories in data much like how traditional beat reporters hunt for news stories. The data life cycle for the data journalist begins with the search for existing data that might have an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question.

Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are often made available to researchers who organize the project and often they are made available in repositories for the general public to further investigate.

The availability of administrative/organizational data creates other opportunities. Researchers can link data collected from scientific studies with, say medical data that have been collected for healthcare purposes; in other words, administrative data collected for reasons that don't directly stem from the question of interest can be useful in other settings. Such linkages can help data scientists expand the possibilities of their analyses and cross-check the quality of their data. In addition, found data can include digital traces, such as your web-browsing activity, posts on social media, and online network of friends and acquaintances, and can be quite complex.

When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional smaller research studies. We might even consider these large datasets as a replacement for scientific studies or essentially a census. This over-reach is referred to as the "big data hubris". Data with a large scope does not mean that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, dependency, and reliability. One well-known example is the Google Flu Trends tracking system.

Example: Google Flu Trends

Digital epidemiology, a new subfield of epidemiology, leverages data generated outside the public health system to study patterns of disease and health dynamics in populations 1 The Google Flu Trends (GFT) tracking system was one of the earliest examples of digital epidemiology. In 2007, researchers found that counting the searches people made for flu-related terms could accurately estimate the number of flu cases. It made headlines, and helped make researchers excited about the possibilities of big data. However, GFT did not live up to expectations and was abandoned in 2015.

What went wrong with GFT? After all, it used millions of digital traces from online queries for terms related to influenza to predict flu activity. Despite initial success, in the 2011–2012 flu season, Google's data scientists found that GFT was not a substitute for more traditionally collected data from the Centers for Disease Control (CDC) surveillance reports, collected from laboratories across the United States. In comparison, GFT overestimated the CDC numbers for 100 out of 108 weeks (see Figure 2.1). Week after week, GFT came in too high for the cases of influenza, even though it was based on big data

Fig. 2.1 Google Flu Trend (GFT) weekly estimates for influenza-like illness. For 108 weeks, GFT (solid line) over estimated the actual CDC reports (dashed line) 100 times. Also plotted are predictions from a model based on 3-week old CDC data and seasonal trends (dotted line).

Fig. 2.1 Google Flu Trend (GFT) weekly estimates for influenza-like illness.

Data scientists found that the GFT was not a substitute for more traditionally collected data from the CDC. A simple model built from past CDC reports that used 3-week-old CDC data and seasonal trends did a better job of predicting flu prevalence than GFT. That is, the GFT overlooks considerable information that could be extracted by basic statistical methods. This does not mean that big data captured from online activity is useless. In fact, researchers have shown that the combination of GFT data with CDC data can substantially improve on both GFT predictions and the CDC-based model [Lazer et al., 2014, Lazer, 2015]. It is often the case that combining different approaches leads to improvements over individual methods.

The GFT example shows us that even when we have tremendous amounts of information, the connections between the data, the topic of investigation, and the question being asked are paramount. Understanding this framework can help us avoid answering the wrong question, applying inappropriate methods to the data, and overstating our findings.

In the age of big data, we are tempted to collect more and more data. After all, a census gives us perfect information, so shouldn't big data be nearly perfect? A key factor to keep in mind is the scope of the data. What population do we want to study? How can we access information about that population? Who or what are we actually studying? Answers to these questions help us see potential gaps in our approach. This is the topic of the next section.