Course Introduction

Time: 67 hours

Free Certificate
Several Python modules such as pandas, scikitlearn, scipy.stats, and statsmodels will be introduced that are useful for data analysis, data visualization, and data mining. The course will gradually shift from introductory topics such as a review of Python, matrix operations, and statistics to applications and implementing programs involving data mining, visualization, statistical models, and time series analysis.
First, read the course syllabus. Then, enroll in the course by clicking "Enroll me in this course". Click Unit 1 to read its introduction and learning outcomes. You will then see the learning materials and instructions on how to use them.

Unit 1: What is Data Science?
This unit will introduce you to the field of data science. Before delving into the programming aspects of the course, it is important to have a clear view of what data science is. There are many techniques and computational methodologies for dealing with data science problems. The goal of this unit is to help put the rest of the course in context and help you understand how to conceptually organize various facets of the field.
When attempting to solve a data science problem, the overarching goal is to derive inferences and draw conclusions based on existing data sets. Such inferences are made through statistical, computational, and visualization techniques. Furthermore, even before computations can be made, data sets must often be curated and refined. This unit will help you to order and categorize your thinking to understand the flow of data science processes.
Completing this unit should take you approximately 7 hours.
Unit 2: Python for Data Science
This unit will introduce the Python IDE we will use in this course. We will also introduce installing Python modules relevant to upcoming units. The primary goals of this unit are to ensure that all required software is ready to run and to review the Python programming language.
Developing expertise in data science requires understanding drawn from a breadth of different subjects such as numerical methods, matrix computations, statistics, data processing, visualization, data mining, and statistical modeling. Python is an excellent vehicle for developing this expertise because of the availability of various modules capable of addressing these topics. In this course, we will use the numpy module to introduce numerical methods and matrix computations. The pandas module (built upon the numpy module) will be used for data processing and visualization. Basic statistical calculations will be accomplished using scipy and pandas. Data rendering and visualization applications will be addressed using the matplotlib and seaborn. Finally, data mining and statistical modeling will be introduced using sckitlearn and statsmodels. Mastering data science in the context of these Python modules will position you to delve into deeper subjects such as machine learning and deep learning.
You should leave this unit being able to write Python programs that can perform basic computational and data processing tasks. We will discuss core concepts such as data types, operators, functions, conditional statements, loops, and file handling. In addition, we will also give examples of Python data structures, such as lists and dictionaries, as well as basic objectoriented syntax. Understanding these data structures will enable you to implement basic plotting and data rendering instructions using the matplotlib module. This abbreviated (yet thorough) set of topics will serve as the programming vocabulary you will need to complete the course.
Completing this unit should take you approximately 6 hours.
Unit 3: The numpy Module
Data science must deal with the various forms that data can take on when it is collected. While Python lists and dictionaries are powerful, data can often come in the form of numerical arrays or spreadsheets. This unit introduces the numpy module, which is useful for performing matrix operations on data contained within arrays. When you finish this unit, you will be able to implement a host of matrix operations relevant to data science.
Much of the syntax used for lists, such as indexing and slicing, naturally carries over to numpy arrays. Array operations in numpy are typical of what you would expect from basic linear algebra (such as matrix addition and multiplication, systems of linear equations, determinants, and matrix inverses). Additionally, methods within the matplotlib module can accept numpy arrays as input. Finally, we will also take a look at file handling using numpy, since having this skill is often useful for machine learning applications.
Completing this unit should take you approximately 6 hours.
Unit 4: Applied Statistics in Python
As data science can often involve making statistical inferences from data, many of the upcoming units will apply calculations rooted in probability and statistics. This unit is foundational in that it discusses various ways of generating random data, computing basic statistical measures, and performing statistical analyses in Python. When you finish this unit, you will be able to implement and apply Python methods from the scipy.stats module.
You have already seen that the random module can generate scalar random numbers and that numpy can generate arrays of random numbers. We will also find that many numpy methods extend quite naturally to the pandas module we will introduce in the next unit. Additionally, the scipy.stats module allows for statistical modeling and parameter calculations. These Python implementations will serve as a foundation for more sophisticated methods discussed we will use later in the course.
Completing this unit should take you approximately 13 hours.
Unit 5: The pandas Module
This unit introduces the pandas module, which is necessary for generalizing numpy array operations to dataframes containing things like spreadsheet data. When you finish this unit, you will be able to process pandas dataframes and perform the suites of calculations we outlined in the earlier units.
The pandas module contains many methods that greatly simplify processing data files relevant to data science. Data collected from realworld situations is often messy, and can contain observations you might want to discard. The pandas module offers useful methods that can deal with such situations. We will also discuss methods for visualizing data using pandas.
Completing this unit should take you approximately 4 hours.
Unit 6: Visualization
Now that you've mastered the basic statistical, array, and spreadsheet data processing techniques, it is natural to want to plot, render, and visualize that data. In this unit, we will discuss visualization techniques beyond those introduced using matplotlib and pandas. When you finish this unit, you will be able to implement and visualize data plots applied within the field of data science.
The matplotlib module is convenient for visualizing data formatted using numpy arrays. pandas is similarly equipped for pandas dataframes. The seaborn module is also designed to work with pandas data. It is extremely powerful for rendering data using methods professional data scientists would find useful. While matplotlib provides basic plotting capabilities, seaborn can be used to construct bar charts, violin plots, heat maps, and more. These more advanced forms of visualization for statistical data sets can enable one to immediately draw inferences based on patterns discerned within the plots.
Completing this unit should take you approximately 4 hours.
Unit 7: Data Mining I – Supervised Learning
Data mining attempts to find patterns and relationships within and between given data sets. The field of data mining is vast, so we have broken down its introduction into two units: supervised and unsupervised learning. We will then move on to statistical modelbuilding. When you finish this unit, you will be able to implement learning systems fundamental to the field of data mining.
This unit discusses the basics of supervised learning, feature extraction, dimensionality reduction, and training and testing of supervised learning models. We will focus on benchmark models fundamental to data mining, such as Bayes' decision and Knearest neighbor. We will implement them using the scikitlearn module. Understanding these methods will prepare you for future excursions into machine learning and deep learning.
Completing this unit should take you approximately 11 hours.
Unit 8: Data Mining II – Clustering Techniques
This unit extends the material in the previous unit to clustering techniques, which are useful for creating pattern classification models where the input classes are unknown (which we call unsupervised learning). When you finish this unit, you will be able to create programs capable of training and testing unsupervised learning models. As in the previous unit, we will implement these techniques using the scikitlearn module.
The clustering of input feature vectors can be accomplished in several different ways. This unit focuses on two techniques: Kmeans, which requires some knowledge of the number of classes, and hierarchical clustering, which allows the input data to gradually define the number of classes. Both methodologies have their place within the field of data science.
Completing this unit should take you approximately 5 hours.
Unit 9: Data Mining III – Statistical Modeling
There is more to data science than simply analyzing data. The ability to build a model from a data set implies that some deeper relationship amongst data observations has been captured. Now that we have covered the basics of data mining, it makes sense to consider statistical models that allow for inference and also have some predictive power.
This unit demonstrates how to apply the scikitlearn module to building regression models. We will also show how to interpret model parameters once a model has been constructed. This unit will teach you how to apply Python for creating regression models, drawing inferences, and making predictions using computed model parameters.
Completing this unit should take you approximately 5 hours.
Unit 10: Time Series Analysis
The last step in this introduction to data science requires us to deal with data derived from time series, such as stock prices as a function of time. All the tools from the earlier units will play a role in performing these analyses. As in the last unit, our goal is to build statistical models that allow for inference and prediction.
This unit introduces models for analyzing timeseries data. The statsmodels module contains various analysis tools, including methods for handling stationary and nonstationary data. This unit will focus on constructing autoregressive, moving average, and autoregressive integrated moving average models. This unit will teach you how to implement Python programs capable of statistical inference and forecasting from timeseries data.
Completing this unit should take you approximately 6 hours.
Study Guide
This study guide will help you get ready for the final exam. It discusses the key topics in each unit, walks through the learning outcomes, and lists important vocabulary. It is not meant to replace the course materials!
Course Feedback Survey
Please take a few minutes to give us feedback about this course. We appreciate your feedback, whether you completed the whole course or even just a few resources. Your feedback will help us make our courses better, and we use your feedback each time we make updates to our courses.
If you come across any urgent problems, email contact@saylor.org or post in our discussion forum.
Certificate Final Exam
Take this exam if you want to earn a free Course Completion Certificate.
To receive a free Course Completion Certificate, you will need to earn a grade of 70% or higher on this final exam. Your grade for the exam will be calculated as soon as you complete it. If you do not pass the exam on your first try, you can take it again as many times as you want, with a 7day waiting period between each attempt.
Once you pass this final exam, you will be awarded a free Course Completion Certificate.
 Receive a grade