CS250: More on K-means Clustering | Saylor Academy

K-Means is an unsupervised method of clustering that groups data points into k clusters in which each observation belongs to the cluster with the nearest mean.

K-Means Clustering is an unsupervised learning algorithm.

Mathematical Model

The K-means algorithm clusters data by separating samples in k groups, minimizing a criterion known as the inertia or within-cluster variance sum-of-squares.

$\underset{S}{arg\,\,min}\sum_{i=1}^{k}\sum_{x\epsilon s_{i}} \left \| x-\mu _{i} \right \|^{2}$

where:

$S$ sets of observations

$k$ number of sets of predictors

$x$ observation data point

$\mu _{i}$ mean of points in $S _{i}$

Model Training Process

Roughly, the model training process is as follows:

Select initial cluster centroids.
Assign data points to the nearest centroid as measured by Euclidean distance.
Compute new cluster centroids.
Repeat steps 2-3 as needed to achieve the best possible within-cluster variance.

Python Example

To download the code below, click here.

"""
k_means_clustering_with_scikit_learn.py
clusters data points and displays the results
for more information on the k-means algorithm, see:
https://scikit-learn.org/stable/modules/clustering.html#k-means
"""

# Import needed libraries.
import numpy as np
import matplotlib.pyplot as plotlib
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Set number of samples and clusters.
number_of_samples = 2000
number_of_clusters = 5

# Set the random number state for cluster center initialization.
random_state = 150

# Set standard deviations for data points creation.
standard_deviation = [1.0, 2.5, 0.5]

# Create data points for clustering.
# For details on the make_blobs function, see:
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
X_varied, y_varied = make_blobs(n_samples=number_of_samples,
                                cluster_std=standard_deviation,
                                random_state=random_state)

# Create clusters.
# For details on the KMeans function, see:
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
y_pred = KMeans(n_clusters=number_of_clusters,
                random_state=random_state)\
                .fit_predict(X_varied)

# Plot the clustered data points.
plotlib.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plotlib.title("Data in " + str(number_of_clusters) + " Clusters")

# Display the plot.
plotlib.show()

Results are below:

Source: Don Cowan, https://www.ml-science.com/k-means-clustering
This work is licensed under a Creative Commons Attribution 4.0 License.

Last modified: Tuesday, 27 September 2022, 6:41 PM

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python