This course attempts to strike a balance between presenting the vast set of methods within the field of data science and Python programming techniques for implementing them. Problem-solving and programming implementation will be emphasized throughout the course. All techniques presented will be introduced using real-world programming examples. A major goal of the course is to ensure that when you finish the course, you will have the programming and conceptual expertise you need to join the field of data science.
Several Python modules such as pandas, scikit-learn, scipy.stats, and statsmodels will be introduced that are useful for data analysis, data visualization, and data mining. The course will gradually shift from introductory topics such as a review of Python, matrix operations, and statistics to applications and implementing programs involving data mining, visualization, statistical models, and time series analysis.
This unit will introduce you to the field of data science. Before delving into the programming aspects of the course, it is important to have a clear view of what data science is. There are many techniques and computational methodologies for dealing with data science problems. The goal of this unit is to help put the rest of the course in context and help you understand how to conceptually organize various facets of the field.
When attempting to solve a data science problem, the overarching goal is to derive inferences and draw conclusions based on existing data sets. Such inferences are made through statistical, computational, and visualization techniques. Furthermore, even before computations can be made, data sets must often be curated and refined. This unit will help you to order and categorize your thinking to understand the flow of data science processes.
Completing this unit should take you approximately 7 hours.
This unit will introduce the Python IDE we will use in this course. We will also introduce installing Python modules relevant to upcoming units. The primary goals of this unit are to ensure that all required software is ready to run and to review the Python programming language.
Developing expertise in data science requires understanding drawn from a breadth of different subjects such as numerical methods, matrix computations, statistics, data processing, visualization, data mining, and statistical modeling. Python is an excellent vehicle for developing this expertise because of the availability of various modules capable of addressing these topics. In this course, we will use the numpy module to introduce numerical methods and matrix computations. The pandas module (built upon the numpy module) will be used for data processing and visualization. Basic statistical calculations will be accomplished using scipy and pandas. Data rendering and visualization applications will be addressed using the matplotlib and seaborn. Finally, data mining and statistical modeling will be introduced using sckit-learn and statsmodels. Mastering data science in the context of these Python modules will position you to delve into deeper subjects such as machine learning and deep learning.
You should leave this unit being able to write Python programs that can perform basic computational and data processing tasks. We will discuss core concepts such as data types, operators, functions, conditional statements, loops, and file handling. In addition, we will also give examples of Python data structures, such as lists and dictionaries, as well as basic object-oriented syntax. Understanding these data structures will enable you to implement basic plotting and data rendering instructions using the matplotlib module. This abbreviated (yet thorough) set of topics will serve as the programming vocabulary you will need to complete the course.
Completing this unit should take you approximately 6 hours.
The vast majority of coding examples in this course will be phrased in the form of a Python notebook. Google Colaboratory (or, just Google Colab) is an online interface for creating Python notebooks. It is extremely convenient because a vast set of Python modules (and, in particular, the modules used in this course) are already installed and, once imported, are ready for use.
Assuming you have a Gmail account and are logged in, click the link provided to run Google Colab notebooks. In the menu bar, choose
After doing so, you should see an editable code cell appear in which you can enter a series of Python instructions. Your notebook is automatically linked to the Google Drive associated with your Gmail account. Your Google Colab notebooks are automatically created on your Google Drive in My Drive in a folder named Colab Notebooks. You can treat them as files within Google Colab and perform typical file operations such as renaming, opening, and saving.
print('Hello world')
a = 10
print('a =', a)
numpy
: array operationspandas
: dataframe operationsscipy.stats
: basic statisticsmatplotlib, pandas, seaborn
: visualizationscikit-learn
: data miningstatsmodels
: time series analysisUpcoming units will explain and apply these modules. Before doing so, this unit will first review the basics of the Python programming language.
Fundamental to all programming languages are datatypes and operators. This lesson provides a practical overview of the associated Python syntax. Key takeaways from this unit should be the ability to distinguish between different data types (int, float, str). Additionally, you should understand how to assign a variable with a value using the assignment operator. Finally, it is essential to understand the difference between arithmetic and comparison (or "relational") operators. Make sure to thoroughly practice the programming examples to solidify these concepts.
The math module elevates Python to perform operations available on a basic calculator (logarithmic, trigonometric, exponential, etc.). At various points throughout the course, you will see this module invoked within programming examples using the import command:
import math
Commands from this module should be part of your repertoire as a Python programmer.
We conclude this section by putting together concepts regarding datatypes, operators, and the math module. Upon completing this tutorial, you should be pretty comfortable with the basics of the Python programming language.
The core of Python programming is conditional if/else/elif control statements, loops, and functions. These are powerful tools that you must master to solve problems and implement algorithms. Use these materials to practice their syntax. If/else/elif blocks allow your program to make conditional decisions based upon values available at the time the block is executed. Loops allow you to perform a set of operations a set number of times based on a set of conditions. Python has two types of loops: for loops and while loops. For loops work by counting and terminating when the loop has executed the prescribed number of times. While loops work by testing a logical condition, and the loop terminates when the logical condition evaluates to a value of false. Functions are designed to take in a specific set of input values, operate on those values, and then return a set of output results.
Use this video to reinforce the syntax and applications of if/else/elif statements, loops, and functions. Make sure you feel comfortable writing functions and testing values as they are input and returned from functions that you have written. Additionally, understanding program flow as you navigate through if-elif blocks implies that you can predict the outcome of the block before the code executes. This way, you will convince yourself that you understand how if/else/elif statements work. Finally, practice writing loops and predicting what values variables should take during each loop iteration.
Python has several built-in data structures. Lists allow you to create a collection of items that can be referenced using an index, and you can modify their values (they are "mutable"). Tuples are similar to lists except that, once created, you cannot modify their values (they are "immutable"). Dictionaries allow you to create a collection of items that can be referenced by a key (as opposed to an index). Use this section to learn about and practice programming examples involving lists, tuples, and dictionaries.
Here is more practice with tuples and dictionaries. In addition, the Python built-in data structure known as a set is also covered. Sets are not ordered, and their elements cannot be indexed (sets are not lists). To understand Python set operations, remind yourself of basic operations such as the union and intersection. Use this tutorial to compare and contrast the syntax and programming uses for lists, tuples, sets, and dictionaries.
Watch this video for more examples of tuples, sets, and dictionaries. Practice the examples and make sure you understand how to apply these data structures using functions. This video is designed to help you put together the concepts in this section.
Before diving into some aspects of random number generation using more advanced modules, review how to generate random numbers using the random module by practicing the code shown in these videos.
The matplotlib module is the first of several modules you will be introduced to for plotting, visualizing, and rendering data. Beginning at 20:40, follow this video to learn the basics of applying matplotlib. It is fundamental to understand how to annotate a graph with labels. We will revisit many of these concepts in upcoming units. Make sure to implement the programming examples to get used to the syntax.
Although matplotlib can work with Python lists, it is often applied within the context of the numpy module and the pandas module. This tutorial briefly references these modules, which will be introduced and elaborated upon in upcoming units. For now, follow along with the programming examples to learn some more basic matplotlib plotting commands.
Take this assessment to see how well you understood this unit.
Data science must deal with the various forms that data can take on when it is collected. While Python lists and dictionaries are powerful, data can often come in the form of numerical arrays or spreadsheets. This unit introduces the numpy module, which is useful for performing matrix operations on data contained within arrays. When you finish this unit, you will be able to implement a host of matrix operations relevant to data science.
Much of the syntax used for lists, such as indexing and slicing, naturally carries over to numpy arrays. Array operations in numpy are typical of what you would expect from basic linear algebra (such as matrix addition and multiplication, systems of linear equations, determinants, and matrix inverses). Additionally, methods within the matplotlib module can accept numpy arrays as input. Finally, we will also take a look at file handling using numpy, since having this skill is often useful for machine learning applications.
Completing this unit should take you approximately 6 hours.
As data science can often involve making statistical inferences from data, many of the upcoming units will apply calculations rooted in probability and statistics. This unit is foundational in that it discusses various ways of generating random data, computing basic statistical measures, and performing statistical analyses in Python. When you finish this unit, you will be able to implement and apply Python methods from the scipy.stats module.
You have already seen that the random module can generate scalar random numbers and that numpy can generate arrays of random numbers. We will also find that many numpy methods extend quite naturally to the pandas module we will introduce in the next unit. Additionally, the scipy.stats module allows for statistical modeling and parameter calculations. These Python implementations will serve as a foundation for more sophisticated methods discussed we will use later in the course.
Completing this unit should take you approximately 13 hours.
This unit introduces the pandas module, which is necessary for generalizing numpy array operations to dataframes containing things like spreadsheet data. When you finish this unit, you will be able to process pandas dataframes and perform the suites of calculations we outlined in the earlier units.
The pandas module contains many methods that greatly simplify processing data files relevant to data science. Data collected from real-world situations is often messy, and can contain observations you might want to discard. The pandas module offers useful methods that can deal with such situations. We will also discuss methods for visualizing data using pandas.
Completing this unit should take you approximately 4 hours.
Now that you've mastered the basic statistical, array, and spreadsheet data processing techniques, it is natural to want to plot, render, and visualize that data. In this unit, we will discuss visualization techniques beyond those introduced using matplotlib and pandas. When you finish this unit, you will be able to implement and visualize data plots applied within the field of data science.
The matplotlib module is convenient for visualizing data formatted using numpy arrays. pandas is similarly equipped for pandas dataframes. The seaborn module is also designed to work with pandas data. It is extremely powerful for rendering data using methods professional data scientists would find useful. While matplotlib provides basic plotting capabilities, seaborn can be used to construct bar charts, violin plots, heat maps, and more. These more advanced forms of visualization for statistical data sets can enable one to immediately draw inferences based on patterns discerned within the plots.
Completing this unit should take you approximately 4 hours.
Data mining attempts to find patterns and relationships within and between given data sets. The field of data mining is vast, so we have broken down its introduction into two units: supervised and unsupervised learning. We will then move on to statistical model-building. When you finish this unit, you will be able to implement learning systems fundamental to the field of data mining.
This unit discusses the basics of supervised learning, feature extraction, dimensionality reduction, and training and testing of supervised learning models. We will focus on benchmark models fundamental to data mining, such as Bayes' decision and K-nearest neighbor. We will implement them using the scikit-learn module. Understanding these methods will prepare you for future excursions into machine learning and deep learning.
Completing this unit should take you approximately 11 hours.
This unit extends the material in the previous unit to clustering techniques, which are useful for creating pattern classification models where the input classes are unknown (which we call unsupervised learning). When you finish this unit, you will be able to create programs capable of training and testing unsupervised learning models. As in the previous unit, we will implement these techniques using the scikit-learn module.
The clustering of input feature vectors can be accomplished in several different ways. This unit focuses on two techniques: K-means, which requires some knowledge of the number of classes, and hierarchical clustering, which allows the input data to gradually define the number of classes. Both methodologies have their place within the field of data science.
Completing this unit should take you approximately 5 hours.
There is more to data science than simply analyzing data. The ability to build a model from a data set implies that some deeper relationship amongst data observations has been captured. Now that we have covered the basics of data mining, it makes sense to consider statistical models that allow for inference and also have some predictive power.
This unit demonstrates how to apply the scikit-learn module to building regression models. We will also show how to interpret model parameters once a model has been constructed. This unit will teach you how to apply Python for creating regression models, drawing inferences, and making predictions using computed model parameters.
Completing this unit should take you approximately 5 hours.
The last step in this introduction to data science requires us to deal with data derived from time series, such as stock prices as a function of time. All the tools from the earlier units will play a role in performing these analyses. As in the last unit, our goal is to build statistical models that allow for inference and prediction.
This unit introduces models for analyzing time-series data. The statsmodels module contains various analysis tools, including methods for handling stationary and nonstationary data. This unit will focus on constructing autoregressive, moving average, and autoregressive integrated moving average models. This unit will teach you how to implement Python programs capable of statistical inference and forecasting from time-series data.
Completing this unit should take you approximately 6 hours.
This study guide will help you get ready for the final exam. It discusses the key topics in each unit, walks through the learning outcomes, and lists important vocabulary. It is not meant to replace the course materials!
Please take a few minutes to give us feedback about this course. We appreciate your feedback, whether you completed the whole course or even just a few resources. Your feedback will help us make our courses better, and we use your feedback each time we make updates to our courses.
If you come across any urgent problems, email contact@saylor.org.
Take this exam if you want to earn a free Course Completion Certificate.
To receive a free Course Completion Certificate, you will need to earn a grade of 70% or higher on this final exam. Your grade for the exam will be calculated as soon as you complete it. If you do not pass the exam on your first try, you can take it again as many times as you want, with a 7-day waiting period between each attempt.
Once you pass this final exam, you will be awarded a free Course Completion Certificate.