Clustering with scikit-learn

It is time to put together concepts from this and the previous unit. This tutorial uses k-NN as the classifier, given clustering results from the K-means algorithm. In essence, an unsupervised method is used as the input to a method that requires supervised data.

K-means Clustering

We will use the array objects from the Python module numpy:

import numpy

X = numpy.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

To use the K-means clustering algorithm from Scikit-learn, we import it and specify the number of clusters (that is, the k) and the random state to initialize the centroid centers of the clusters. We assume that the data can be grouped into two clusters:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)

We can now apply the clustering algorithm to the data points in X

kmeans.fit(X)

print(kmeans.labels_)
[0 0 0 1 1 1]

The output above shows the assignment of data points to clusters.

We can use the model now to make predictions about other data points:

print(kmeans.predict([[0, 0], [4, 4]]))
[0 1]

We can also output the centroids of the two clusters:

print(kmeans.cluster_centers_)
[[ 1.  2.]
 [ 4.  2.]]

If we want to use K-Nearest Neighbor in Scikit Learn, we need to import the KNeighborsClassifier from the neighbors submodule:

from sklearn.neighbors import KNeighborsClassifier

We instantiate a KNN-classifier:

KNNClassifier = KNeighborsClassifier(n_neighbors = 3)

We use the following dataset X and class-vector y:

X = [[0, 1], [1, 1], [2, 4], [3, 4]]
y = [0, 0, 1, 1]

We train the classifier:

KNNClassifier.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

We ask the classifier to suggest a class for an unseen vector:

print(KNNClassifier.predict([[1.1, 0.9]]))
[0]

It can also give us the likelihoods for the probability of a data-point being in any of the classes:

print(KNNClassifier.predict_proba([[2.9, 3.1]]))
[[ 0.33333333  0.66666667]]

We might not have a class assignment for a sample set. If we want to use a sample set to find the closes data point, we could use the KNeighborsClassifier as well. Here is a sample.

samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]

We can now train for K=1, the nearest neighbor model:

from sklearn.neighbors import NearestNeighbors

KNNClassifier = NearestNeighbors(n_neighbors=1)
KNNClassifier.fit(samples)
NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=1, p=2, radius=1.0)

We could ask for the nearest neighbor of a concrete data-point:

print(KNNClassifier.kneighbors([[1., 1., 1.]]))
(array([[ 0.5]]), array([[2]]))

The returned result [[0.5]] and [[2]] means that the nearest neighbor is the third sample in samples and that the distance between the two is 0.5. One can also query for the distance of multiple data points. In this case, the output of the distance is suppressed:

X = [[0., 1., 0.], [1., 0., 1.]]

KNNClassifier.kneighbors(X, return_distance=False)
array([[1],
       [2]])


Source: Damir Cavar, http://damir.cavar.me/pynotebooks/Python_Clustering_with_Scikit-learn.html
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Last modified: Tuesday, September 27, 2022, 7:09 PM