This unit will introduce you to the field of data science. Before delving into the programming aspects of the course, it is important to have a clear view of what data science is. There are many techniques and computational methodologies for dealing with data science problems. The goal of this unit is to help put the rest of the course in context and help you understand how to conceptually organize various facets of the field.
When attempting to solve a data science problem, the overarching goal is to derive inferences and draw conclusions based on existing data sets. Such inferences are made through statistical, computational, and visualization techniques. Furthermore, even before computations can be made, data sets must often be curated and refined. This unit will help you to order and categorize your thinking to understand the flow of data science processes.
Completing this unit should take you approximately 7 hours.
When learning any new field, context is everything; therefore, we will begin this course by introducing the history of data science. In this way, you will be able to understand how the field became an amalgamation of various areas of science dealing with data in many different forms. In this section, and as the course continues, pay close attention to the various ways data can be represented and analyzed.
The field of data science is quite diverse. Before getting into the technical details of the course, it is important to gain some perspective on how the pieces fit together. As you go through this section, remember that we are driving toward the nexus of coding implementations (in Python) for data analysis and modeling. As the course progresses, Python implementations will require a mixture of mathematical and visualization techniques. For now, use this introduction to order your understanding of the field. Watch the first 1 minute and 40 seconds of this video.
As you immerse yourself in this introductory phase of the course, you will transition from a qualitative understanding of concepts to a more quantitative understanding. This present step involves seeing examples of what real data looks like, how it is formatted, and various approaches for dealing with analyses using mathematics and visualization. If this section is truly doing its job, you should ask yourself: "how might the formats and analyses presented be implemented using a programming language?" We will gradually answer this question as we go deeper into the course.
Now that you have some terminology and methods under your belt, we can begin to put together an understanding of a typical data science pipeline from beginning to end. Data usually comes in a raw form and so it must be curated and prepared. This is the process of data engineering. At this point, data analysis techniques such as visualization and statistical analyses should lead to some sense of what relationships exist within the data. Hence, the next step is to derive a model for the data (either by building statistical models or applying machine learning, for example). This process is repeated and refined until quantifiable measures of success have been deemed to be met.
To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field.
With the materials introduced so far, you are now in a position to consider what interests you most about data science. In this course, you will have a chance to involve yourself in data analysis, modeling, engineering, and mechanics. This involvement will entail the ability to quantify using Python as the tool for implementation.
As we approach the end of this introductory unit, it is important to tie up any loose ends. There are practical aspects to working within the field of data science. For instance, what conferences do data scientists attend, and where do data scientists "hang out" together? What is it like when presenting data science findings to colleagues in your organization? Use this video to get a better sense of what the field is about.
There are two major approaches to data science: analytical mathematics (including statistics) and visualization. These two categories are not mutually exclusive. However, mathematical analysis would be considered more of a "left-brain"' approach, while visualization would reflect a more "right-brain" approach. Both are powerful approaches for analyzing data, and we should not choose one or exclude the other. Visualization is a sensible vehicle for introducing the field because data relationships become immediately apparent to the naked eye. Use the materials in this section to compare and contrast analytic approaches versus visualization approaches. In this course, we will try to strike a healthy balance between the two.
Take this assessment to see how well you understood this unit.