CS250 Study Guide

Unit 1: What is Data Science?

1a. Explain what data science is

  • What is data science?
  • What disciplines are associated with data science?
  • What types of data does data science deal with?
  • What is data engineering?

Data science is the field of collecting, handling, and analyzing data to draw knowledge from data. It has a long history dating back to when the term "data science" was coined in 2001. As the ability to store and operate on large databases increased, it became clear that a convergence of many different disciplines was required to draw conclusions from large, possibly distributed datasets. Hence, data science requires overlapping expertise in methods drawn from computer science, mathematics, and statistics. At its core, it is directly related to the scientific method as it describes the process of formulating a hypothesis, acquiring the necessary data, performing data analysis, and, finally, drawing conclusions or making predictions. Applying the scientific method requires logical reasoning, which can be divided into two broad categories: deductive and inductive. For example, conclusions may be drawn from a hypothesis test that attempts to quantify the deviation from a null hypothesis, a proposition that reflects current understanding.

It is important to be aware of the many types of data, such as video, audio, images, text, number, and so on. When embarking upon a data science project, data may be converted or transformed from its raw form into a different form that lends itself to a specific analysis technique. Data engineering is the aspect of data science that deals specifically with collecting, curating, storing, and retrieving data. In some sense, data engineering is the initial point from which all other analyses will follow.

To review, see A History of Data Science.

 

1b. Explain data analysis and data modeling methodologies

  • What are the essential aspects of the data science life cycle?
  • What is a data science model?
  • What basic modeling and analysis methodologies should every data scientist know about?

The data science life cycle emphasizes the reality that, during data analysis and modeling, there may not be a perfectly straight line between input data and output results. Since conclusions are not known a priori, it is often necessary to take initial results, refine them and then reassess the analysis or modeling methodology. Along the data life cycle journey, it may become necessary to build a model; hence, there are basic approaches that every data scientist needs to be aware of. A model is a data science methodology for representing a system that takes input data and generates outputs consistent with what is expected from a dataset(s) under consideration. Data models that are statistical in their approach are usually constructed from samples taken from a population. In this class of models, it is important to understand concepts such as target population, access frame, and sampling. A major reason for constructing these class models is to reduce statistical bias, which is the difference between your model and what is actually measured in reality.

Linear models are another important class of models that can either be statistical (such as regression) or deterministic (such as the method of least squares), but the main goal is to identify the best straight line that explains the data. Probabilistic models can be useful for generating and analyzing random data sets. A good example is the urn model, which analyzes the process of drawing indistinguishable marbles from an urn.

To review, see The Data Science Lifecycle.

 

1c. Explain techniques for approaching data science

  • How do various disciplines view data science problems?
  • Why is visualization important to data science?
  • How do techniques from optimization theory play a role in data science?

As data science involves the intersection of many disciplines, it is important to understand how individuals from these disciplines view data science problems. For example, mathematicians might view a problem in terms of theorems and equations; on the other hand, computer scientists might think of numerical methods and algorithms. Visualization and data rendering in data science are essential because a picture truly can be worth a thousand words. Rendering data in two or three dimensions (using, for example, a heatmap) can often reveal correlations within a dataset via immediate visual inspection.

Statisticians often think in terms of sampling, where statistical tests are a function of the data available. For example, probability sampling is a general term for selecting a sample from a population. Simple random sampling is a type of probability sampling where a subset of participants is chosen randomly from a population. For larger populations, cluster sampling is a type of probability sampling where a population is divided into smaller groups or "clusters". A sample is then formed by randomly selecting from among the clusters. Data scientists must know how to sample a population for proper experimental design.

Optimization theory is a field necessary for implementing data science techniques. Although this fact is not always pointed out at the introductory level, data science results often come as the output of some objective optimization criteria. For example, when finding critical points in a function such as maxima or minima, points where a slope equals zero are identified. Additionally, objective measures such as mean squared error and loss functions are regularly applied in machine learning techniques to measure the performance of a model.

To review, see A History of Data Science, The Data Science Lifecycle, and Thinking about the World.

 

Unit 1 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam. 

  • cluster sampling
  • data engineering
  • data science
  • hypothesis testing
  • indistinguishable
  • logical reasoning
  • loss function
  • model
  • random sampling
  • sample
  • scientific method
  • slope
  • statistical bias
  • visualization