The Data Science Pipeline

Now that you have some terminology and methods under your belt, we can begin to put together an understanding of a typical data science pipeline from beginning to end. Data usually comes in a raw form and so it must be curated and prepared. This is the process of data engineering. At this point, data analysis techniques such as visualization and statistical analyses should lead to some sense of what relationships exist within the data. Hence, the next step is to derive a model for the data (either by building statistical models or applying machine learning, for example). This process is repeated and refined until quantifiable measures of success have been deemed to be met.

Data is no longer scarce. In fact, businesses have an abundance of data and its growing. This has given rise to the term Big Data. Data science enables businesses to discover valuable insights from data and apply that profitably. Data science is therefore complementary to Big Data.

Historically, statisticians had a mathematical focus. They evolved into data analysts who applied their expertise to solving business problems. They did this by visualizing data and searching for patterns. When dealing with vast amounts of data, there was a need to apply Machine Learning algorithms and programming. This is where a data scientist comes in.

A data scientist is really a first-class scientist who's curious, asks questions and makes hypotheses that can be tested with data.


Discussion

  • What exactly is the definition of the term "Data Science"?


What is a Data Scientist? Source: Gualtieri 2013.

One possible definition is that "data science is a multifaceted discipline, which encompasses machine learning and other analytic processes, statistics and related branches of mathematics, increasingly borrows from high performance scientific computing, all in order to ultimately extract insight from data and use this new-found information to tell stories".

At a high level, "data science is the study of the generalizable extraction of knowledge from data". It's a combination of multiple disciplines that have been around for decades.

Data Science Association's Professional Code of Conduct states that a "Data Scientist means a professional who uses scientific methods to liberate and create meaning from raw data".


What's a typical Data Science process?

The Data Science process. Created at Harvard by Joe Blitzstein and Hanspeter Pfister. Source: Venturi 2017.

The Data Science process. Created at Harvard by Joe Blitzstein and Hanspeter Pfister. Source: Venturi 2017.

In science, one starts with a hypothesis, conducts experiments and makes observations to either prove or disprove the hypothesis. In Data Science, the scientific process is similar except that the use of data and algorithms becomes central to the process.

The process starts with an interesting question, often aligned to business goals. Available data is then cleaned and filtered. This may also involve collecting new data as relevant to the question. Data is analyzed to discover patterns and outliers. A model is built and validated, often using machine learning algorithms. Model is often refined in an iterative manner. The final step is to communicate the results. The results may inspire the data scientist to ask and investigate further questions.


What's a typical Data Science pipeline?


The Data Science pipeline. Source: Jones 2018.

The Data Science pipeline. Source: Jones 2018.

The data science pipeline may be treated as that part of the data science process that deals specifically with data. It starts with the gathering of raw data, processing it, analyzing it via algorithms and finally visualizing the results. Thus, the pipeline basically transforms data into useful insights.

An important aspect of this pipeline is data engineering. It can be broken down into three steps (not standardized):

  • Data Wrangling: Raw data is cast to a form suitable for analysis. This could involve combining multiple datasets, removing inconsistencies, converting datasets to a common format, etc.
  • Data Cleansing: Real-world data is messy with missing values, bad delimiters or inconsistent records. Cleansing ensures and even repairs data to syntactic and semantic correctness. Data points could be dropped if they cannot be repaired.
  • Data Preparation: This makes the data suitable as input to algorithms. This may involve range normalization, conversion of categorical data to numerical values, etc.

When building ML models, the typical approach is to partition available data into training and testing datasets. The former is used for learning and the latter is used for validation.


Could you give some examples of questions that data science answers?


Here are a few examples: Will this tire fail in the next 1000 miles? Is this bank transaction normal? What will be my sales for the next quarter? What sort of customers are not coming back to my store? Which printer models are failing the same way? As a self-driving car, should I now slow down or brake completely?

In the real world, data science has been successfully used by LinkedIn to increase growth in user connections. Google uses it in a number of products. GE uses it to optimize service contracts. Netflix uses it to improve movie recommendation. Kaplan uses it to uncover effective learning strategies. All these are a result of data scientists asking the right questions.


How is a data scientist different from a data analyst/architect/engineer?

Data Science is a mix of many disciplines. Source: Barber 2018.

Data Science is a mix of many disciplines. Source: Barber 2018.

A data scientist is multidisciplinary in terms of skills and expertise. She may embody some of the other related roles:

  • Data Analyst: Collects relevant data, visualizes data with various tools and tries to find patterns and insights. Knows basic statistics. Has business/domain knowledge. Probably doesn't deal with big data.
  • Data Architect: Architects a system to manage big data. Often this role is embodied within a data engineer since tools and technologies overlap. This role becomes redundant with MLaaS (Machine Learning as a Service).
  • Data Engineer: Develops and manages infrastructure that deals with big data. Well versed with tools such as Hadoop, NoSQL and MapReduce. Sets up data pipelines.

Where a data scientist stands out is in her use of ML algorithms, which requires both statistics and computational skills. A data scientist augments these skills with her ability to deal with large datasets and domain knowledge.


Should a data scientist start with a problem statement or explore available data?

If you're new to data science, without much domain knowledge, defining a problem statement can be difficult. In such a case, you could start with exploratory analysis. This can then guide you towards asking the right questions.

Given enough data, exploration is likely to yield patterns and correlations. These could even occur due to measurement errors or data processing artifacts. But are these findings relevant? Asking the right questions and defining a problem statement will give better focus. Some even claim that lack of a problem definition could lead to disaster because you don't know what you're looking for.


What skills must a data scientist have?

Skills for today's data scientist. © Krzysztof Zawadzki. Source: CISELab 2018.

Skills for today's data scientist. © Krzysztof Zawadzki. Source: CISELab 2018.

Data scientists are required to be multidisciplinary with knowledge and expertise in statistics, programming, application domain, analytics and communication. However, unicorns who have all of these are rare.

Often specializing in a couple of areas with exposure to others is desirable. Companies don't rely on a single all-knowing data scientist; they form a data science team. A data science team may include the Chief Data Officer, business analyst, data analyst, data scientist, data architect, data engineer and application engineer. Some of these may be combined in a single person. For example, a single person may fulfil the roles of data architect and data engineer.

Anyone with strong data and computational abilities can do well as a data scientist. An essential skill is to turn unstructured data into a form suitable for analysis. This is not something that a traditional quantitative analyst can do.

Technical skills must be complemented with business acumen, creativity and reasoning. This will help a data scientist ask relevant questions, assess the suitability of available data and present results the right way.


Should a data scientist learn cloud computing?

Data science workflows typically happen on a local computer. There are however scenarios where cloud computing makes sense. The dataset could be too large for local memory; or local computational capability is insufficient for the problem; or the workflow's output feeds into a larger production environment.

When dealing with large datasets, a data scientist needs to get familiar with cloud technologies, platforms and tools. This may include storage, running database queries and managing Apache Spark clusters.


As a beginner, what should be my learning path to become a data scientist?

One approach is to be practical and hands-on from the outset. Pick a topic in which you're passionate and curious. Research available datasets. Tweet and discuss so that you get clarity. Start coding. Explore. Analyze. Build data pipelines for large datasets. Communicate your results. Repeat this with other datasets and build a public portfolio. Along the way, pick up all the skills you need.

You may instead prefer a more formal approach. You can learn the basics of languages such as R and Python. Follow this with additional packages/libraries particular to data science: (R) dplyr, ggplot2; (Python) NumPy, Pandas, matplotlib. Get introduced to statistics. From this foundation, start your journey into Machine Learning. To relate these to business goals, some recommend the book Data Science for Business by Provost and Fawcett. But you should put all this knowledge into practice by taking up projects with datasets and problems that interest you.

At the University of Wisconsin, statistics is covered first before programming. To become inter-disciplinary, you may choose to learn aspects of data engineering (data warehousing, Big Data) and ethics.


Could you give some tips for a budding data scientist?

The following tips might help:

  • Data science is about answering scientific questions with the help of data. Don't focus just on the aspect of handling data, dataset size or the tools.
  • Understand the business, its products, customers and strategies. This will help you ask the right questions. Have constant interaction with business counterparts. Communicate with them in a language they can understand.
  • Consider alternative approaches before selecting one that suits the problem. Likewise, select a suitable metric. Sometimes derived metrics may yield better prediction compared to available metrics.
  • Understand the pros and cons of various ML algorithms before selecting one for your problem.
  • Craft machine learning models from scratch. Don't just rely on premade templates and libraries. Test them to their limits to understand what's going to work, & where.
  • Find a compromise between speed and perfection. On-time delivery should be preferred over extreme accuracy.
  • Useful data is more important than lots of data. Use multiple data sources to better understand data and its discrepancies.
  • Be connected with the data science community, be it via blogs, meetups, conferences or hackathons.
  • Practice with open datasets. Learn from the solutions of others.


Milestones

1962

John W. Tukey publishes "The Future of Data Analysis". He explains that statistics has mostly been about making inferences but his interest is in data analysis, which has more to do with science than mathematics. The availability of computers makes data analysis possible. His influential paper is sometimes today referred to as FoDA.

1974

The term data science is used for the first time, by Peter Naur in his "Concise Survey of Computer Methods". He defines it as the "science of dealing with data". His definition does not consider data semantics (domain knowledge). Thus, it's different from modern definition of the term. An alternative term datalogy is also used.

1976

John Chambers at Bell Labs create programming language S. This lays the basis for statistical computing and quantitative programming environments (QPE) that use scripts and workflows. In the 1990s, S inspires the creation of an open source language called R, which is today the dominant QPE.

1977

The International Association for Statistical Computing (ISAC) is formed. This underscores the increasing use of computing in statistical work "to convert data into information and knowledge". The same year John Tukey publishes "Exploratory Data Analysis" where he states that we should use data to form hypotheses to test. Exploratory Data Analysis and Confirmatory Data Analysis should both be used.

1989

The first Knowledge Discovery in Databases (KDD) conference is held. By the mid-1990s, this evolves into ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Researchers clarify that "KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data".

1997

Professor C. F. Jeff Wu calls for statistics to be renamed data science and statisticians to be renamed data scientists. The same year the journal Data Mining and Knowledge Discovery is launched.

2001

William S. Cleveland publishes "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". He proposes the term Data Science in its modern sense. To Cleveland, a data analyst is good at programming but has limited knowledge of statistics. A data scientist on the other hand comes from statistics background but has to work more closely with computer specialists.

2001

Leo Breiman publishes "Statistical Modeling: The Two Cultures". He compares Generative Modeling with Predictive Modeling. The former is dominant among statisticians. Breiman calls them to adopt predictive modeling and algorithms, which have developed in other fields.

2008

Data Science as a term creates interest thanks to the work of D. J. Patil (LinkedIn) and Jeff Hammerbacher (Facebook).

2009

Google's chief economist, Hal Varian, states that data is plenty but there's a scarcity of experts who can extract value from it. The sexy job for the next decade will be statisticians. He states his expectation of a data scientist,

The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it - that’s going to be a hugely important skill in the next decades.

2010

By the start of this decade, researchers and writers attempt to explain data science to the public. Data scientist is claimed to be the sexiest job of the 21st century. This decade also sees shifting terminology. Data Mining is now referred to as Machine Learning. The work of a data analyst is called Business Intelligence but when she uses big data it's called Big Data Analytics.

Sep 2015

The University of Michigan announces a $100 million Data Science Initiative (DSI) to hire 35 new faculty. It's press release says, "Data science has become a fourth approach to scientific discovery, in addition to experimentation, modeling, and computation".

2017

Data science hierarchy of needs. Source: Rogati 2017b.

Data science hierarchy of needs. Source: Rogati 2017b.

Monica Rogati puts AI at the top of a pyramid that signifies data science hierarchy of needs. She makes the point that companies can't have a successful AI strategy without basic data literacy, collection and infrastructure. It's also not a good idea to over-engineer the infrastructure. Instead, build an MVP spanning all layers of the pyramid and then scale horizontally.


Source: Devopedia, https://devopedia.org/data-science
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Last modified: Thursday, March 14, 2024, 3:39 PM