Implementing Logistic Regression

This video gives an example of implementing logistic regression. Given the K-NN, decision tree, and logistic regression classifiers, you should begin to see a theme arising based on the supervised learning pipeline. In the next section, we will complete the pipeline by exercising model evaluations using the techniques we've discussed.


Data set

We are going to build a logistic regression model for the iris data set. Its features are sepal length, sepal width, petal length, and petal width. Besides, its target classes are setosa, versicolor and virginica. However, it has 3 classes in the target, and this causes us to build three different binary classification models with logistic regression. To make it simple, I will drop virginica classes in the data set and make it a binary data set.

Iris data set

Iris data set

Pre-processing

Let's remember the logistic regression equation first.

z = w0 + w1x1+ w2x2+ w3x3 + w4x4

y = 1 / (1 + e-z)

x1 stands for sepal length; x2 stands for sepal width; x3 stands for petal length; x4 stands for petal width.

The output y is the probability of a class. If it gets closer to 1, then the instance will be versicolor whereas it becomes setosa when the proba gets closer to 0.

The output is unitless. If the left side of the equation has no unit, then its right side must not have a unit as well. Let's focus on the z equation. x1 term stands for sepal length, and its unit is centimeters. To make the equation z unitless, the multiplication of x1 and w1 has to be unitless as well.

We can divide the x1 term by the standard deviation to get rid of the unit because the unit of standard deviation is the same as its feature. Alternatively, we can feed x1 as is and find w1 first. We know that its unit becomes 1/centimeters in this case. If we multiply the w1 term by the standard deviation of the x1, then it works as well. I prefer to apply the first one in this study.


Loading the data set

Luckily, sklearn offers iris data set as an out-of-the-box function.

from sklearn.datasets import load_iris
import pandas as pd
 
feature_names = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
 
x, y = load_iris(return_X_y=True)
df = pd.DataFrame(x, columns = feature_names)
df['target'] = y
print(df.head())

As I mentioned before, I'm going to drop the virginica classes in the data set to make it binary classification problem.

 
#0: setosa, 1: versicolor, 2: virginica
df = df[df['target'] != 2]


Normalize inputs

I'm going to walk over the columns and divide each instance by the standard deviation of the column. In this way, features become unitless.

for feature_name in feature_names:
    df[feature_name] = df[feature_name] / df[feature_name].std()

Some researchers subtract the mean of the column from each instance first, then divide it by the standard deviation. They will both work.


Modelling

We've finished pre-processing the data set. We have the unitless features and binary class values in the target. We can build a logistic regression model now.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=0).fit(df[feature_names].values, df['target'].values)
 
score = model.score(df[feature_names].values, df['target'].values)
print(score)

I got 100% accuracy for 100 instances. Of course, that's the training set accuracy, and I should split the data set into train, test, and validate, but this is an experimental study, and I skip those stages. I am not interested in overfitting this study.


Prediction

Built model stores intercept and coefficients already. Let's focus on those parameters to understand the algorithm well.

idx = 99
x = df.iloc[idx][feature_names].values
y = model.predict_proba(x.reshape(1, -1))[0]
print(y[1])

The logistic regression model has the following equation:

y = -0.102763 + (0.444753 * x1) + (-1.371312 * x2) + (1.544792 * x3) + (1.590001 * x4)

Let's predict an instance based on the built model.

idx = 99
x = df.iloc[idx][feature_names].values
y = model.predict_proba(x.reshape(1, -1))[0]
print(y[1])

Prediction of the 100th instance (notice that index starts with 0) is 0.9782192589879745 based on the predict proba function. We can find the same value based on the equation.

import math
def sigmoid(x):
    return 1 / (1 + pow(math.e, -x))
 
result = 0
result += w0
for i in range(0, 4):
    result += x[i] * w[i]
result = sigmoid(result)
print(result)

This calculates the result 0.9782192589879745 as well.


Source: Sefik Ilkin Serengil, https://sefiks.com/2021/01/06/feature-importance-in-logistic-regression/
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

Last modified: Tuesday, September 27, 2022, 5:49 PM