Topic | Name | Description |
---|---|---|
Course Syllabus | Course Syllabus | |
1.1: Introduction to Data Science | A History of Data Science | When learning any new field, context is everything; therefore, we will begin this course by introducing the history of data science. In this way, you will be able to understand how the field became an amalgamation of various areas of science dealing with data in many different forms. In this section, and as the course continues, pay close attention to the various ways data can be represented and analyzed. |
Understanding Data Science | The field of data science is quite diverse. Before getting into the technical details of the course, it is important to gain some perspective on how the pieces fit together. As you go through this section, remember that we are driving toward the nexus of coding implementations (in Python) for data analysis and modeling. As the course progresses, Python implementations will require a mixture of mathematical and visualization techniques. For now, use this introduction to order your understanding of the field. Watch the first 1 minute and 40 seconds of this video. |
|
1.2: How Data Science Works | How Data Science Works | As you immerse yourself in this introductory phase of the course, you will transition from a qualitative understanding of concepts to a more quantitative understanding. This present step involves seeing examples of what real data looks like, how it is formatted, and various approaches for dealing with analyses using mathematics and visualization. If this section is truly doing its job, you should ask yourself: "how might the formats and analyses presented be implemented using a programming language?" We will gradually answer this question as we go deeper into the course. |
The Data Science Pipeline | Now that you have some terminology and methods under your belt, we can begin to put together an understanding of a typical data science pipeline from beginning to end. Data usually comes in a raw form and so it must be curated and prepared. This is the process of data engineering. At this point, data analysis techniques such as visualization and statistical analyses should lead to some sense of what relationships exist within the data. Hence, the next step is to derive a model for the data (either by building statistical models or applying machine learning, for example). This process is repeated and refined until quantifiable measures of success have been deemed to be met. |
|
The Data Science Lifecycle | To bring this section to a close, we will present the data science pipeline slightly differently. This is because there is not always a "one size fits all" for a given data science problem. Therefore, it is important to see a slightly different perspective on the process of solving a data science problem. In this way, you can round out your understanding of the field. |
|
1.3: Important Facets of Data Science | Data Scientist Archetypes | With the materials introduced so far, you are now in a position to consider what interests you most about data science. In this course, you will have a chance to involve yourself in data analysis, modeling, engineering, and mechanics. This involvement will entail the ability to quantify using Python as the tool for implementation. |
What is the Field of Data Science? | As we approach the end of this introductory unit, it is important to tie up any loose ends. There are practical aspects to working within the field of data science. For instance, what conferences do data scientists attend, and where do data scientists "hang out" together? What is it like when presenting data science findings to colleagues in your organization? Use this video to get a better sense of what the field is about. |
|
Thinking about the World | There are two major approaches to data science: analytical mathematics (including statistics) and visualization. These two categories are not mutually exclusive. However, mathematical analysis would be considered more of a "left-brain"' approach, while visualization would reflect a more "right-brain" approach. Both are powerful approaches for analyzing data, and we should not choose one or exclude the other. Visualization is a sensible vehicle for introducing the field because data relationships become immediately apparent to the naked eye. Use the materials in this section to compare and contrast analytic approaches versus visualization approaches. In this course, we will try to strike a healthy balance between the two. |
|
2.1: Google Colaboratory | Introduction to Google Colab | The vast majority of coding examples in this course will be phrased in the form of a Python notebook. Google Colaboratory (or, just Google Colab) is an online interface for creating Python notebooks. It is extremely convenient because a vast set of Python modules (and, in particular, the modules used in this course) are already installed and, once imported, are ready for use. Assuming you have a Gmail account and are logged in, click the link provided to run Google Colab notebooks. In the menu bar, choose File --> New notebook
Upon entering some basic lines of Python code in the code cell such as:
print('Hello world') you can practice executing your code cell in one of several ways:
In this course, you will be introduced to several Python modules applicable for data science computations and program implementations:
Upcoming units will explain and apply these modules. Before doing so, this unit will first review the basics of the Python programming language. |
2.2: Datatypes, Operators, and the math Module | Data Types in Python | Fundamental to all programming languages are datatypes and operators. This lesson provides a practical overview of the associated Python syntax. Key takeaways from this unit should be the ability to distinguish between different data types (int, float, str). Additionally, you should understand how to assign a variable with a value using the assignment operator. Finally, it is essential to understand the difference between arithmetic and comparison (or "relational") operators. Make sure to thoroughly practice the programming examples to solidify these concepts. |
Operators and the math Module | The math module elevates Python to perform operations available on a basic calculator (logarithmic, trigonometric, exponential, etc.). At various points throughout the course, you will see this module invoked within programming examples using the import command: import math Commands from this module should be part of your repertoire as a Python programmer. We conclude this section by putting together concepts regarding datatypes, operators, and the math module. Upon completing this tutorial, you should be pretty comfortable with the basics of the Python programming language. |
|
2.3: Control Statements, Loops, and Functions | Functions, Loops, and Logic | The core of Python programming is conditional if/else/elif control statements, loops, and functions. These are powerful tools that you must master to solve problems and implement algorithms. Use these materials to practice their syntax. If/else/elif blocks allow your program to make conditional decisions based upon values available at the time the block is executed. Loops allow you to perform a set of operations a set number of times based on a set of conditions. Python has two types of loops: for loops and while loops. For loops work by counting and terminating when the loop has executed the prescribed number of times. While loops work by testing a logical condition, and the loop terminates when the logical condition evaluates to a value of false. Functions are designed to take in a specific set of input values, operate on those values, and then return a set of output results. |
Functions and Control Structures | Use this video to reinforce the syntax and applications of if/else/elif statements, loops, and functions. Make sure you feel comfortable writing functions and testing values as they are input and returned from functions that you have written. Additionally, understanding program flow as you navigate through if-elif blocks implies that you can predict the outcome of the block before the code executes. This way, you will convince yourself that you understand how if/else/elif statements work. Finally, practice writing loops and predicting what values variables should take during each loop iteration. |
|
2.4: Lists, Tuples, Sets, and Dictionaries | Data Structures in Python | Python has several built-in data structures. Lists allow you to create a collection of items that can be referenced using an index, and you can modify their values (they are "mutable"). Tuples are similar to lists except that, once created, you cannot modify their values (they are "immutable"). Dictionaries allow you to create a collection of items that can be referenced by a key (as opposed to an index). Use this section to learn about and practice programming examples involving lists, tuples, and dictionaries. |
Sets, Tuples, and Dictionaries | Here is more practice with tuples and dictionaries. In addition, the Python built-in data structure known as a set is also covered. Sets are not ordered, and their elements cannot be indexed (sets are not lists). To understand Python set operations, remind yourself of basic operations such as the union and intersection. Use this tutorial to compare and contrast the syntax and programming uses for lists, tuples, sets, and dictionaries. |
|
Examples of Sets, Tuples, and Dictionaries | Watch this video for more examples of tuples, sets, and dictionaries. Practice the examples and make sure you understand how to apply these data structures using functions. This video is designed to help you put together the concepts in this section. |
|
2.5: The random Module | Python's random Module | Before diving into some aspects of random number generation using more advanced modules, review how to generate random numbers using the random module by practicing the code shown in these videos. |
2.6: The matplotlib Module | Visualization and matplotlib | The matplotlib module is the first of several modules you will be introduced to for plotting, visualizing, and rendering data. Beginning at 20:40, follow this video to learn the basics of applying matplotlib. It is fundamental to understand how to annotate a graph with labels. We will revisit many of these concepts in upcoming units. Make sure to implement the programming examples to get used to the syntax. |
Precision Data Plotting with matplotlib | Although matplotlib can work with Python lists, it is often applied within the context of the numpy module and the pandas module. This tutorial briefly references these modules, which will be introduced and elaborated upon in upcoming units. For now, follow along with the programming examples to learn some more basic matplotlib plotting commands. |
|
3.1: Constructing Arrays | Using Matrices | Technically speaking, vectors and matrices are mathematical objects usually applied within the context of linear algebra. An array, on the other hand, is the computer science data structure used to represent a matrix. So, for example, a vector can be stored in a one-dimensional array. A matrix can be stored in a two-dimensional (or "double subscript") array. Much of data science depends upon using arrays (often multi-dimensional) containing numerical data. Before delving into the numpy module, use this playlist to review basic vector and matrix operations. |
Creating numpy Arrays | In contrast to many other programming languages, the basic Python language does not have an explicit array data structure. Instead, arrays can be constructed using nested lists; however, methods for basic matrix operations would then have to be formulated. The numpy module relieves the programmer of this responsibility so that they can readily perform array computations. The first step in applying this module is to learn the syntax for constructing an array. |
|
numpy Fundamentals | The numpy module contains a multitude of methods for array processing. Use the materials in this section to gain fluency in numpy syntax for manipulating arrays. Pay close attention to how you can use relational operators to extract specific elements from an array. |
|
numpy for Numerical and Scientific Computing | As you approach the end of this introduction to constructing arrays in numpy, read this section to familiarize yourself with various numpy methods used throughout the course. Specifically, the need for methods such as shape, size, linspace, reshape, eye, and zeros often arise when manipulating arrays. |
|
3.2: Indexing | numpy Arrays and Vectorized Programming | One major theme of numpy is that your intuition for indexing ordered containers such as Python lists should naturally carry over to arrays. Therefore, the syntax for techniques such as slicing remains consistent. Watch this video to practice the gymnastics of indexing and slicing when applied to numpy arrays. |
Advanced Indexing with numpy | Before continuing, use this video to summarize and reinforce everything discussed in this unit. Additionally, pay close attention to various indexing tricks and shortcuts that can help write compact and efficient code. |
|
3.3: Array Operations | A Visual Intro to numpy and Data Representation | Now that you have mastered some basic numpy syntax, it is time to see how array representations can correspond to actual data. In other words, numbers in an array, by intentional construction of the data scientist, are meant to represent the characteristics of the data. Data can come in many different forms depending upon the application. One dimensional data might be derived from an audio signal. Two dimensional data might be derived from an image. Three dimensional data might be derived from a video signal. A given array representation is clearly problem dependent, so it is important see some examples of how this is accomplished in practice. |
Mathematical Operations with numpy | No discussion of numpy would be complete without introducing how matrix computations are accomplished. For instance, it is important to understand how analogs of methods within the math module (such as cos, log, and exp) can be applied in the numpy module. Additionally, You should give the concept of broadcasting your undivided attention because it defines the rules for how arithmetic operations are applied. |
|
numpy with matplotlib | The numpy module is often used in conjunction with the matplotlib module to visualize array data. Use this video to solidify your understanding of visualization of matrix computations. |
|
3.4: Saving and Loading Data | Storing Data in Files | It is important to have some knowledge of file handling using numpy. Data science applications often require the capacity to save and load array data. The section will help to extend your understanding of file handling constructs beyond that provided within the basic Python language. |
Load Compressed Data using numpy.load | As data arrays get larger and larger, it becomes necessary to compress them if they are to be saved to a file. Data science and machine learning applications using numpy often deal with saving and loading compressed data in the form of files with .npz extensions. This material introduces how to load compressed numpy files. |
|
Saving a Compressed File with numpy | This page gives an example of saving a compressed file using numpy. As with any file format, a programmer should feel comfortable reading and writing files. The reason for compressing files is that the file sizes in machine learning applications can be enormous. Anything you can do to reduce the payload can result in more efficient file processing. |
|
".npy" versus ".npz" Files | Here is a brief comparison between .npy and .npz files. A .npy file is reserved for storing a single array in the form of binary data. A .npz file can store multiple arrays and additionally compresses the data. |
|
4.1: Basic Statistical Measures and Distributions | Applying Statistics | At the core of data science are statistical methods that enable verifying hypotheses, drawing conclusions, making inferences, and forecasting. To understand how the numpy, matplotlib, and scipy modules are to be applied, the fundamentals of statistics must first be introduced (or reviewed). Watch from 4:36:30 to get an overview of how to apply statistics in this unit and throughout the course. |
Key Statistical Terms | Before delving into deeper topics, it is important to be clear about fundamental terms such as probability, statistics, data, and sampling. Additionally, you should master quantities derivable from empirical data, such as frequency, relative frequency, and cumulative frequency. |
|
Descriptive Statistics | Once data has been collected and categorized, visualizations and fundamental calculations help describe the data. The visualization approaches (such as bar charts, histograms, and box plots) and calculations (such as mean, median, and standard deviation) introduced here will be revisited and implemented using Python. |
|
Basic Probability | A random experiment is one where the set of possible outcomes is known, but the outcome is unknown for each experiment. Under these circumstances, given enough data, we can assign a probability to each possible outcome. A host of concepts can be developed from these basic notions, such as independence, mutual exclusivity, and conditional probability. Furthermore, rules governing calculations, such as adding probabilities and multiplying probabilities, naturally follow from these basic concepts. |
|
Distribution and Standard Deviation | Terms and definitions from descriptive statistics readily carry over to situations where values are countable. The set of outcomes for a coin flip, the roll of a dice, or a set of cards are all examples of discrete random variables. We can then put concepts such as mean and standard deviation on a firmer mathematical footing by defining the expected value and the variance of a discrete random variable. |
|
Continuous Probability Functions and the Uniform Distribution | Once you have grasped the notion of a discrete random variable, it should be clear that all random variables need not be discrete. For example, consider measuring the atmospheric temperature at some prescribed location. The measured temperature would be random and could take on a continuum of values (theoretically speaking). Under these circumstances, we say that the random variable is continuous. All the machinery developed for discrete random values (such as expected value, variance, and mean) must be elevated to continuous random variables to handle this situation. The uniform distribution (which you have programmed using the random module) is an example of a continuous probability distribution. |
|
The Normal Distribution | The normal distribution is an example of a continuous distribution. Because it arises so often when considering empirical measurements, it is fundamental to probability and statistics, and we must devote special attention to it. The normal distribution is used as the basis for many statistical tests. Hence, it is essential to understand its mathematical form, graph, z-score, and the area under the curve. |
|
Confidence Intervals | Calculating confidence intervals is fundamental to statistical analyses and statistical inference. This is because statistical calculations such as a mean or a probability are a function of the sample size with respect to the (possibly unknown) size of a larger population. Therefore, you must also include techniques for estimating your confidence in a given value along with the value. As you go deeper into the upcoming units, you will need to understand confidence intervals developed for the normal distribution and the student's t-distribution. |
|
Hypothesis Testing | In addition to calculating confidence intervals, hypothesis testing is another way to make statistical inferences. This process involves considering two opposing hypotheses regarding a given data set (referred to as the null hypothesis and the alternative hypothesis). Hypothesis testing determines whether the null hypothesis can be accepted or rejected. |
|
Linear Regression | At various points throughout the course, it will be necessary to build a statistical model that can be used to estimate or forecast a value based on a set of given values. When one has data where a dependent variable is assumed to depend upon a set of independent variables, linear regression is often applied as the first step in data analysis. This is because parameters for the linear model are very tractable and easily interpreted. You will be implementing this model in various ways using Python. |
|
4.2: Random Numbers in numpy | Using numpy | Data science requires the ability to process data often presented in the
form of arrays. Furthermore, to test models designed for data
mining or forecasting, it is often necessary to generate random arrays
with desired statistical properties. As a first step toward
understanding such implementations, you must learn how to use numpy to create arrays of random numbers and compute basic quantities
such as the mean, the median, and the standard deviation. |
Random Number Generation | Going beyond the basics of random numbers in the numpy module, it is important to see examples of how to compute using various distributions introduced at the beginning of this unit. The code introduced in these materials should be viewed as the array extension of scalar capabilities available within the random module. |
|
Using np.random.normal | Since the normal distribution is fundamental and arises so often in the field of statistical modeling, it is sensible to devote some attention to this subject in the context of numpy computations. This overview provides a simple example of how you can combine computation and visualization for statistical analysis. |
|
A Data Science Example | By itself, numpy can make various statistical calculations (in the next section, you will see how scipy builds upon this foundation). Try running and studying the code in this project to experience a data science application that analyzes empirical speed of light measurements. As a lead-in to the next unit, you should know three instructions from the pandas module (read_csv, rename, and head). The read_csv method is used to load the data from a file into what is called a pandas "data frame" (analogous to a numpy array, but more general). The rename method is used to rename a column within the data frame. The head method prints out and inspects the first few rows of a data frame containing many rows. These methods will be discussed in more detail in the next unit. For now, try and focus on the data science application and the statistical and plotting methods used to analyze the data. |
|
4.3: The scipy.stats Module | Descriptive Statistics in Python | The scipy module was constructed with numpy as its underlying foundation. The numpy module handles arrays efficiently, and scipy can be applied using a vast set of methods for scientific computation. In this unit, we are primarily concerned with applying the statistical power of the scpy.stats module, which, as you will see in this video, goes beyond the capabilities of numpy. |
Statistical Modeling with scipy | This video will enhance your understanding of how scipy.stats can be used. Use this tutorial to increase your vocabulary for statistical data processing. With this overview, we have enough Python syntax to get some applications up and running. It is best to begin implementations with some exercises in the next section. |
|
Probability Distributions and their Stories | Given any module that deals with statistics, one basic skill you must have is to be able to program and create plots of probability distributions typically encountered in the field of data science. This tutorial should remind you of various distributions introduced in this section, but now they are phrased using the scipy.stats module. |
|
4.4: Data Science Applications | Statistics and Random Numbers | It is time to exercise, reinforce, and apply various topics introduced throughout this unit. Study and practice section 1.6.6 to help review the basics of mixing numpy and scipy.stats for plotting data and running statistical tests. Be sure to import matplotlib so the plotting methods will run without exceptions when they are called. |
Statistics in Python | Watch these videos to refine your Python coding skills with concepts such as modeling distributions from sampled data and confidence intervals. If any of the terms in this section are unfamiliar, go back to the first section of this unit to review. |
|
Probabilistic and Statistical Risk Modeling | Study these slides. In this project, you will apply techniques from this unit to analyze data sets using descriptive statistics and graphical tools. You will also write code to fit (that is, estimate distribution parameters) a probability distribution to the data. Finally, you will learn to code various risk measures based on statistical tests. Upon completing this project, you should have a clearer picture of how you can use Python to perform statistical analyses within the field of data science. |
|
5.1: Dataframes | pandas Dataframes | The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module. import pandas as pd has been invoked, you can download the data from the textbook URL as follows url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv' which will create a dataframe named df. You can double-check that the correct data has been loaded by executing df.head(10) which will print out the first 10 rows of the dataframe. However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used: #read the data from local drive This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload. |
How pandas Dataframes Work | There is no substitute for practice when learning a new programming tool. Use this video to exercise, reinforce and fill in any gaps in your understanding of indexing and manipulating dataframes. |
|
5.2: Data Cleaning | Data Cleaning | Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios. |
More on Data Cleaning | These videos give a few more key examples of applying data cleaning methods. They are meant to serve as a summary and review of all pandas concepts we have discussed in this unit. |
|
5.3: pandas Operations: Merge, Join, and Concatenate | pandas Data Structures | With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations. |
Pandas Dataframe Operations | Use these materials to practice slicing, indexing, and applying syntax for merging and filtering dataframes. You should recognize a measure of syntax consistency by using, for example, dictionary keys or array indices. At this point, it also is clear that the pandas module offers a much larger set of capabilities. |
|
5.4: Data Input and Output | Importing and Exporting | We have already shown some initial examples of how to read a CSV file into pandas. This video gives a bit more depth regarding how to import and export dataframes by interacting with your local drive. |
Loading Data into pandas Dataframes | When using Google Colab, some important differences must be considered when handling files (as discussed in the Python review unit). This video discusses various ways to import and export files when dealing with a web-based Python engine. |
|
5.5: Visualization Using the pandas Module | Using pandas to Plot Data | In the next unit, we will discuss the seaborn module for advanced visualization techniques. pandas comes with its own set of visualization and plotting methods (which are mainly derived from matplotlib). A good rule of thumb is that if your data is confined to lists, dictionaries, and numpy arrays, then matplotlib is a good way to go for basic plotting applications. Similarly, pandas offers plotting capabilities for series and dataframes. |
Plotting with pandas | Practice this code to see more examples of plotting using pandas methods. With these fundamentals in place, you will be well positioned for the next unit dealing with advanced visualization techniques. |
|
6.1: The seaborn Module | Visualization with seaborn | seaborn is an advanced visualization module designed to work with pandas dataframes. Follow along with the programming examples for an introduction to seaborn's capabilities. Pay close attention to how it is applied in tandem with pandas. For instance, notice how the fillna method is used for data cleaning. Additionally, observe how powerful seaborn can be, for example, as scatter plots are created for all numeric variables within a dataframe using a single command. |
matplotlib and seaborn | In many Python visualization presentations, you will see an almost "stream of consciousness" movement between matplotlib, pandas, and seaborn. When presented in this way, it can get a little confusing when tutorials jump around from one module to another (as you gain expertise, you most likely will end up doing the same). In this course, we have made extra effort to decouple these modules for you to understand how they work individually. At this advanced stage, however, be prepared for some overlap between various modules when it comes to visualization techniques. Watch this tutorial to practice examples of how matplotlib and seaborn are applied for visualization. |
|
Easy Data Visualization | Watch this tutorial to practice more examples of how pandas and seaborn are applied for visualization. |
|
6.2: Advanced Data Visualization Techniques | Data Visualization in Python | At this point in the course, it is time to begin connecting the dots and applying visualization to your knowledge of statistics. Work through these programming examples to round out your knowledge of seaborn as it is applied to univariate and bivariate plots. |
How to Create a seaborn Boxplot | A tool very often used for plotting the results of statistical experiments is the box plot. It provides a quick visual summary of the maximum, minimum, median, and percent quartiles. Practice these programming examples to apply various quantities previously introduced in the statistics unit. |
|
Practicing Data Visualization | There is no substitute for much programming practice when connecting statistics and visualization. Follow along with this tutorial to refine your programming skills and review scatter plots, bar plots, pairwise plots, histograms, and box plots. |
|
6.3: Data Science Applications | Visualization Examples | With your knowledge of Python visualization, this video offers some food for thought. You should gauge your confidence level for developing and implementing code to analyze data science problems by watching the examples. |
Using Jupyter | Here is more Python practice with a specific application that applies a suite of programming techniques and commands. At this point in the course, your goal is to assimilate the knowledge presented to begin making higher-level connections between the materials presented in the course units. |
|
Visualizing with seaborn | Here is an example that combines much of what has been introduced within the course using a very practical application. You should view this step as a culminating project for the first six units of this course. You should master the material in this project before moving on to the units on data mining. |
|
7.1: Data Mining Overview | Introduction to Data Mining | Data mining involves various algorithms and techniques for database searching, inferring data relationships, pattern recognition, and pattern classification. Pattern recognition is the process of comparing a sample observation against a fixed set of patterns (like those stored in a database) to search for an optimal match. Face recognition, voice recognition, character recognition, fingerprint recognition, and text string matching are all examples of pattern searching and pattern recognition. |
Introduction to Machine Learning | Machine learning is the aspect of data mining that applies algorithms for learning and inferring relationships within empirical data sets. Since machine learning often involves pattern searching and classification, it is a broad subject that encompasses several approaches for constructing data learning and inference models. |
|
Bayes' Theorem | Pattern search and classification problems often involve the application of observing data subject to some set of conditions. Study the relationship between conditional probability and Bayes' Theorem as it is the foundational material for data mining. |
|
Bayes' Theorem and Conditional Probability | Here are more examples of applying Bayes' Theorem and conditional probability to data mining. |
|
Methods for Pattern Classification | At the heart of all pattern search or classification problems (either explicitly or implicitly) lies Bayes' Decision Theory. Bayes' decision simply says, given an input observation of unknown classification, make the decision that will minimize the probability of a classification error. For example, in this unit, you will be introduced to the k-nearest neighbor algorithm. It can be demonstrated that this algorithm can make Bayes' decision. Read this chapter to familiarize yourself with Bayes' decision. |
|
7.2: Supervised Learning | Supervised learning | A set, collection, or database of either pattern or class data is generically referred to as "training data". This is because data mining requires a collection of known or learned examples against which input observations can be compared. For pattern classification, as mentioned in the previous section, there are two broad categories of learned examples: supervised and unsupervised. This unit deals specifically with supervised learning techniques, while the next unit deals with unsupervised learning techniques. Read these basic steps of solving a supervised learning problem. Assuming data has been collected, as this unit progresses, you will understand and be able to implement the process: Training set → Feature selection → Training algorithm → Evaluate model Our tool for these implementing steps will be the scikit-learn module. |
Feature Selection | Feature selection (or "feature extraction") is the process of taking raw training data and defining data features that represent important characteristics of the data. For example, consider an image recognition application. An image can contain millions of pixels. Yet, our eyes key into specific features that allow our brains to recognize objects within an image. Object edges within an image are key features that can define the shape of an object. The original image consisting of millions of pixels can therefore be reduced to a much smaller set of edges. |
|
Model Inspection and Feature Selection | Once a set of features is chosen, a model must be trained and evaluated. Based on these materials, you should now understand how data mining works. The rest of this unit will introduce some practical techniques and their implementations using scikit-learn. |
|
scikit-learn | The scikit-learn module contains a broad set of methods for statistical analyses and basic machine learning. During the remainder of this unit and the next on unsupervised learning, we will introduce scikit-learn in the context of data mining applications. Use this section as an introduction to see how modules such as pandas can be used in conjunction with the sci-kit learn module. Make sure to follow along with the programming examples. There is no substitute for learning by doing. As this course progresses, you will understand more deeply how to apply the methods used in this video. |
|
7.3: Principal Component Analysis | Dimensionality Reduction | As part of the feature optimization process, when faced with a large set of features, a major goal is to determine a combination or mixture of features that lead to optimal model evaluations. You have already seen the subset selection approach, which you can use to reduce the number of features used to describe training set observations. Using the language of vectors and matrices, we can say that we can reduce the dimension of a feature vector if a subset of features is found to give optimal results. Reducing the feature vector dimension is preferable because it directly translates into reduced time to train a given model. Additionally, higher dimensional spaces impede the ability to define distances, as all points in the space begin to appear as if they are all equally close together. |
Principal Component Analysis | Many approaches exist for reducing the dimension of feature vectors while still optimizing model evaluations. The subset selection approach is very useful and regularly applied. On the other hand, this approach may not reveal underlying relationships between the features or describe why certain features work well together while others do not. To do this, it is necessary to develop algorithms and compute recipes for mixing the most relevant features. Principal Component Analysis (PCA) is arguably one of the popular methodologies for achieving this goal. |
|
PCA in Python | In this section, you will learn how to implement and apply PCA for feature optimization and dimensionality reduction using scikit-learn. |
|
7.4: k-Nearest Neighbors | The k-Nearest Neighbors Algorithm | The k-nearest neighbor (k-NN) algorithm attempts to classify an input feature vector by finding the k closest neighbors in a set of predefined classes. Using the word "closest" automatically means that you must choose some measure of distance to decide the class membership. |
Using the k-NN Algorithm | With your current understanding, it is time to implement the k-NN algorithm using scikit-learn. Follow along with this example to gain programming experience. |
|
Nearest Neighbors | Study this example in depth. Notice it uses the same dataset as the previous example; however, the approach to building the data sets differs. It is important to see different perspectives on solving the same problem. |
|
7.5: Decision Trees | Dealing with Uncertainty | A decision tree is a model of decisions and their outcomes. It has found widespread application because of its ease of implementation. Additionally, its compact tree representation is often useful for visualizing the breadth of possible outcomes. |
Classification, Decision Trees, and k-Nearest-Neighbors | Follow this tutorial to learn how to implement the decision tree. These materials also conveniently review k-NN and discuss the pros and cons of each algorithm. |
|
Decision Trees | It is always best to see different programming examples when learning a new topic. There is no substitute for practice. |
|
7.6: Logistic Regression | Logistic Regression | Logistic regression is a nonlinear modification to linear regression. The logistic function often arises in machine learning applications. |
More on Logistic Regression | Here is an introductory example of how to apply scikit-learn to
implement logistic regression. As you follow this programming example,
make sure you understand how the variable definitions relate to the
algorithm. |
|
Implementing Logistic Regression | This video gives an example of implementing logistic regression. Given the K-NN, decision tree, and logistic regression classifiers, you should begin to see a theme arising based on the supervised learning pipeline. In the next section, we will complete the pipeline by exercising model evaluations using the techniques we've discussed. |
|
7.7: Training and Testing | Supervised Learning and Model Validation | A final step in the supervised learning process is to evaluate a trained model using data not contained within the training set. Use this video to practice programming examples involving training and testing. |
Training and Tuning a Model | Use this video to practice the concepts presented in this unit. This material is crucial as it combines all the steps outlined in the supervised learning section so that they can be implemented using scikit-learn. We will cover implementing linear regression in an upcoming unit. For now, use these examples to learn how to implement and evaluate the different machine learning approaches covered in this unit. |
|
8.1: Unsupervised Learning | Unsupervised Learning | Now that you have had a chance to understand and implement supervised data mining techniques, we can move on to unsupervised techniques. Unsupervised learning assumes no labels for training observations. We let the data tell us what its classification should be. This can be done using many approaches, but we will focus on clustering techniques in this course. |
More on Unsupervised Learning | We will continue to use scikit-learn for implementations. As you can see, there are several methods contained within the module. This unit will focus on K-means and agglomerative clustering. Follow along with the code for implementing these methods and begin to get used to the new syntax. As the next sections unfold, the meaning of the instructions related to clustering will become clearer. |
|
8.2: K-means Clustering | K-means Clustering | The K-means algorithm attempts to optimally partition a data set into K clusters. The number of clusters, K, must be input to the algorithm (although many variations exist that attempt to estimate K). The main concept to grasp is the centroid of a set of training vectors. Assuming each training vector contains d features (that is, d-dimensional training vectors), a mean vector (or "centroid") for a set of vectors can be formed by computing the empirical mean of each component separately. This is how you generalize from computing the mean using scalar data versus vector data. |
More on K-means Clustering | You are now in a position to draw a direct line between the algorithm and its associated Python implementation. This particular example creates a training set using random data so that it becomes obvious how the algorithm works. |
|
Implementing K-means Clustering | This tutorial is an excellent exercise for your Python coding skills because it shows how to implement the K-means algorithm from scratch and then implement it using scikit-learn. Additionally, you will also learn how to evaluate clustering performance as a function of the parameter K. This is an important new step because the number of clusters is the biggest unknown behind this algorithm. |
|
Interpreting the Results of Clustering | Here is an example of applying K-means to cluster customer data. Study the code in depth to learn how to use visualization for interpreting the clustering results. |
|
PCA and Clustering | This tutorial introduces examples, including the analysis of handwritten digits, and then applies PCA to reduce the dimensionality of the data set. Observe how it connects with programming concepts introduced in the previous unit dealing with PCA. |
|
8.3: Hierarchical Clustering | Hierarchical Clustering | In this section, you will learn about hierarchical clustering and, in particular, agglomerative clustering. In contrast to K-means, this methodology does not require you to know the number of clusters in advance. This information is generated from a dendrogram created by the algorithm. As clusters of points are created, notions of the distance between two sets (that is, the "linkage') must be understood when applying this algorithm. You should already know how to compute the Euclidean distance between two points. This article also points out that there are many ways to compute the distance between points (Manhattan, maximum, Mahalanobois, etc.). We can also use these functions for point distances to compute the distance between two sets of points. For example, single linkage computes set distances by choosing the two closest points. Complete linkage chooses the two most distant points. Average distance computes the average of all distances between all points from both sets and so on. Read through this article to get an overview of hierarchical clustering. |
Hierarchical Clustering Using Trees | Here is a visual introduction to hierarchical clustering that walks you through a practical example. |
|
Agglomerative Clustering | Work through this example to draw a line from the agglomerative clustering algorithm to its equivalent Python implementation using scikit-learn. Pay attention to how the data sets are created and how they relate to each iteration as the clusters gradually form. Use the dendrogram to determine the best number of clusters and compare your result to the distribution of the original data. Try to take the extra step of generating a scatter plot using your visualization knowledge. |
|
Applying Clustering | This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clutering performance. |
|
Comparing Aggomerative and K-means Clustering | This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clustering performance. |
|
8.4: Training and Testing | Clustering with scikit-learn | It is time to put together concepts from this and the previous unit. This tutorial uses k-NN as the classifier, given clustering results from the K-means algorithm. In essence, an unsupervised method is used as the input to a method that requires supervised data. |
Putting It All Together | This tutorial is a culminating project that combines the concepts (clustering, dimensionality reduction, cluster evaluation, and visualization) presented in this unit. Work through the programming examples to ensure you have a complete understanding. |
|
9.1: Linear Regression | Simple Linear Regression | Now that data mining algorithms and, in particular, supervised learning concepts have been covered, it is time to address the construction of statistical models. The subject of linear regression has been mentioned in a perfunctory way at several points throughout the course. In this unit, we will delve more deeply into this technique. In its simplest form, the goal is to optimally identify the slope and intercept for empirical data assumed to depend linearly upon some independent variable. Linear regression is a statistical supervised learning technique because training data for the independent variable is mapped to data associated with the dependent variable. Once the linear model is created, obtaining estimates for data not contained within the training set becomes possible. Ensure you understand the examples and associated calculations in the video, such as residuals, the correlation coefficient, and the coefficient of determination. Additionally, if necessary, you may want to review hypothesis testing and tests for significance introduced in the statistics unit. After this video, you will learn how to implement this technique using scikit-learn. However, as a programming exercise, you should feel confident in writing code to implement the regression equations. |
Implementing Simple Linear Regression with scikit-learn | This tutorial runs through the basics of scikit-learn syntax for linear regression. Pay close attention to how the data is generated for this example. Notice how list comprehension is used to create a list for the dependent and independent variables. Furthermore, the dependent variable list is formed by adding random Gaussian noise to each value in the independent variable list. These lists are then converted to numpy arrays to train the linear regression model. By construction, there exists a linear relationship between the independent and dependent variables. Linear regression is then used to identify the slope and intercept, which should match the empirical data. Finally, predictions of the dependent variable are made using independent variable data not contained in the training set. |
|
Practicing Linear Regression | Follow this example for more practice with linear regression implementation. Like the previous example, by construction, data is generated having a linear relationship. However, notice that the data generation technique is quite different from the previous example, as numpy methods are used directly to generate random arrays. In addition, this example multiplies the slope by a small amount of random noise (rather than adding noise to the linear model as is usually assumed in the linear regression derivation). |
|
Multiple Linear Regression | Up to this point, we have discussed linear regression for a single independent variable. Watch this video to see how to extend these ideas to multiple linear regression, which constructs linear models using multiple independent variables. |
|
Multiple Regression in scikit-learn | The LinearRegression method in sckikit-learn can handle multiple independent variables to perform multiple linear regression. Follow this tutorial which combines your knowledge of pandas with scikit-learn. |
|
9.2: Residuals | The Assumptions of Simple Linear Regression | The key to creating any statistical model is to verify if the model actually explains the data. In addition to simple visual inspection, residuals provide a pathway for making a rigorous estimate of model accuracy when applying any form of regression. Read this overview of how residuals are applied. |
Residual Plots and Regression | Use this video to tie up any conceptual loose ends and see more examples of how residuals can help evaluate model accuracy. |
|
Simple Linear Regression Project | Work through this project to put together the concepts introduced so far. To navigate to the notebook portion of the project, click on the SLRProject.ipynb link near the top of the page to run it in Google Colab. Assuming the following command has been invoked: import pandas as pd you can run the commands: data = pandas.read_csv("http://www.econometrics.com/intro/SALES.txt") to access the dataset and see the first few lines printed to the screen. |
|
9.3: Overfitting | Overfitting | Always be suspicious of a perfect fit for your data for machine learning problems. A model that fits a training set well but gives poor testing results is said to overfit the training data. This caution is reserved for any learning model. We introduce it here as a means of connecting concepts together with the data mining units. Read the following article for an overview of overfitting. |
Overfitting in a Learning Model | Follow this practice example to see how overfitting can occur within a learning model. |
|
9.4: Cross-Validation | What is Cross-Validation? | Read through this article for a brief visual summary of cross-validation. |
More on Cross-Validation | Cross-validation is a technique for validating learning models. Up until this point in the course, model evaluations have only been applied using a single test (usually by splitting up a data set into a training set and a test set). In practice, a statistical distribution of test results must be constructed. Only then can confidence intervals be applied to the resulting distribution. Read through this article to understand cross-validation. |
|
Cross-Validation in Machine Learning | Work through this programming example in order to implement a cross-validation scheme on a scikit-learn data set you have seen in the previous units. |
|
Statistical Modeling Project | Use this project as a culminating exercise to implement the concepts presented in this unit. This exercise will show you how to obtain a data set, create the model, examine residuals, visualize results, validate the model and apply the model. |
|
10.1: The statsmodels Module | Introduction to statsmodels | Many Python modules that have statistical capabilities are not completely disjoint. You probably have noticed that there is some measure of overlap between scipy.stats, numpy, pandas, and sckit-learn (for example, scipy.stats can perform linear regression using the linregress method). This is to simplify the import process when making basic statistical calculations on arrays and dataframe data. On the other hand, there comes a point where major differences become obvious. This motivating example compares the functionality of the linregress method against the ols method from statsmodels. Follow this tutorial to see how the statsmodels module improves upon a module such as scipy.stats when building statistical models. |
Regression Using statsmodels | This example is similar to the previous but constructs a simple data set to easily digest the report results generated by statsmodels. |
|
Using scikit-learn with statsmodels | This tutorial is designed to help you jump from the scikit-learn module to statsmodels. Practice the code examples in order to thoroughly grasp the differences. The housing dataset USA_Housing.csv in this tutorial is available here or on the Kaggle website, as mentioned in the video. You can download this file to your local drive. If you are using Google Colab, you can use the instructions outlined in subunit 5.1 of this course for loading a local file. |
|
10.2: Autoregressive (AR) Models | Time Series Basics | A time series is a set of points that are ordered in time. Each time point is usually assigned an integer index to indicate its position within the series. For example, you can construct a time series by measuring and computing an average daily temperature. When the outcome of the next point in a time series is unknown, the time series is said to be random or "stochastic" in nature. A simple example would be creating a time series from a sequential set of coin flips with outcomes of either heads or tails. A more practical example is the time series of prices of a given stock. |
Autoregressive Models | This article delves a bit deeper into the mathematics behind AR models. You may notice a common theme developing where, as with linear regression, the least squares approach is used (in this case, to identify the model coefficients from empirical time series data). |
|
Time Series and Forecasting | This tutorial introduces time series analysis and concludes with coding the AR model using statsmodels. Follow along with the programming example for practice. Note that statsmodels.tsa.AR has been deprecated in favor of statsmodels.tsa.AutoReg due to processing improvements within statsmodels. |
|
10.3: Moving Average (MA) Models | Moving-Average Models | Since AR models only look back over a finite number of samples, they need time to adjust to unexpected shocks in a time series. You must model past instances of the input noise to handle unforeseen shocks. Moving average (MA) models can be used for this purpose. Read this article to learn the general structure of the MA model. |
MA Model Examples | This tutorial provides several examples of MA models of various orders. In addition, the partial autocorrelation (PACF) function is introduced. The ACF and PACF are important tools for estimating the order of a model based on empirical data. |
|
AR and MA Models | This video summarizes the key points regarding AR and MA models. In general, stationary time series modeling requires a balance between these two approaches. In the next section, you will learn how to combine them and apply them in time series analysis. |
|
10.4: Autoregressive Integrated Moving Average (ARIMA) Models | ARIMA Models | The autoregressive integrated moving average (ARIMA) model is an approach for nonstationary time series. It applies a combination of AR and MA modeling to balance out time series variances that can occur within a stochastic process. Additionally, it is often possible to convert a nonstationary time series to stationary series by taking successive differences. The "I" in ARIMA stands for the number of differences needed to eliminate nonstationary behavior. Read this article to get an overview of the mathematical form of the ARIMA(p,d,q) approach to model building. Take note of how you can use various choices of the p, d, and q parameters to form AR, MA, ARMA, or ARIMA models. |
ARIMA in Python | Use this tutorial to implement an ARIMA model and make forecasts. General reference is made to a data set, but you must obtain your own CSV file for actual data. A great source for data scientists is Kaggle. With your current expertise, you should be able to search for and download a .csv file with stock price data that is not too large (<50MB). Additionally, as illustrated in the tutorial, you can apply pandas to extract a column of data. |
|
ARIMA and Seasonal ARIMA Models | This tutorial delves a bit deeper into statistical models. Study it to better understand the ARIMA and seasonal ARIMA models. Consider closely the discussion of how to apply the ACF and PACF to estimate the order parameters for a given model. In practical circumstances, this is an important question as it is often the case that such parameters would initially be unknown. |
|
ARIMA(p,d,q) | Here is a practical application of the ARIMA model. Although this tutorial makes brief references to the R language, you should use it to tie together the concepts (AR, MA, ACF, and PACF) presented in this unit. |
|
Time Series Forecasting with ARIMA | This tutorial demonstrates how to implement the models and forecasting discussed in this unit. Since we are using Google Colab, you can jump to Step 2 to begin this programming example. Upon completing this tutorial, you should be able to construct models, make forecasts and validate forecasts given a time series data set. |
|
Study Guide | CS250 Study Guide | |
Course Feedback Survey | Course Feedback Survey |