Evaluating Regression Models: Metrics and Loss Functions

Site: Saylor University
Course: CS207: Fundamentals of Machine Learning
Book: Evaluating Regression Models: Metrics and Loss Functions
Printed by: Guest user
Date: Wednesday, April 15, 2026, 7:12 PM

Description

Introduction

Performance metrics are vital for supervised machine learning models. To be sure that your model is doing well in its predictions, you need to evaluate the model. Our goal is to identify how well the model performs on new data.

There are some evaluation metrics that can help you determine whether the model's predictions are accurate to a certain level of performance.


Source: Marcin Rutecki, https://www.kaggle.com/code/marcinrutecki/regression-models-evaluation-metrics/notebook#2.-Regression-Evaluation-Metrics
Licensed under the Apache License, Version 2.0 (the "License").

Regression Evaluation Metrics

All of these are loss functions, because we want to minimize them.

 

Mean Absolute Error (MAE)

is the mean of the absolute value of the errors:

\(\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|\)

The Mean absolute error represents the average of the absolute difference between the actual and predicted values in the dataset. It measures the average of the residuals in the dataset.

 

Mean Squared Error (MSE)

is the mean of the squared errors:

\(\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2\)

Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.

 

Root Mean Squared Error (RMSE)

is the square root of the mean of the squared errors:

\(\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}\)

RMSE measures the standard deviation of residuals.

 

R-squared (Coefficient of determination)

  1. SST (or TSS)

The Sum of Squares Total/Total Sum of Squares (SST or TSS) is the squared differences between the observed dependent variable and its mean.

  1. SSR (or RSS)

Sum of Squares Regression (SSR or RSS) is the sum of the differences between the predicted value and the mean of the dependent variable.

  1. SSE (or ESS)

Sum of Squares Error (SSE or ESS) is the difference between the observed value and the predicted value.

Coefficient of Determination scatter plot. Vertical lines and labels show SST, SSE, and SSR, representing variance measures.

The coefficient of determination or R-squared represents the proportion of the variance in the dependent variable which is explained by the linear regression model. When R² is high, it represents that the regression can capture much of variation in observed dependent variables. That’s why we can say the regression model performs well when R² is high.

\(R^2 = 1- \frac {SSR}{SST}\)

It is a scale-free score i.e. irrespective of the values being small or large, the value of R square will be less than one. One misconception about regression analysis is that a low R-squared value is always a bad thing. For example, some data sets or fields of study have an inherently greater amount of unexplained variation. In this case, R-squared values are naturally going to be lower. Investigators can make useful conclusions about the data even with a low R-squared value.

\(R^2 = 1\)

All the variation in the y values is accounted for by the x values.

R2=1.jpeg

\(R^2=0.83\)

83 % of the variation in the y values is accounted for by the x values.

R2=0.8.jpeg

\(R^2=0\)

None of the variation in the y values is accounted for by the x values.

R2=0.jpeg

 

Adjusted R squared

\(R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}\)

Adjusted R squared is a modified version of R square, and it is adjusted for the number of independent variables in the model, and it will always be less than or equal to R².In the formula below n is the number of observations in the data and k is the number of the independent variables in the data.

 

Cross-validated R2

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validated R2 is a median value or R2 taken from cross-validation procedure.

CV.png

 

Regression Evaluation Metrics - Conclusion

Both RMSE and R-Squared quantifies how well a linear regression model fits a dataset. When assessing how well a model fits a dataset, it’s useful to calculate both the RMSE and the R2 value because each metric tells us something different.

  • RMSE tells us the typical distance between the predicted value made by the regression model and the actual value.

  • R2 tells us how well the predictor variables can explain the variation in the response variable.

Adding more independent variables or predictors to a regression model tends to increase the R2 value, which tempts makers of the model to add even more variables. Adjusted R2 is used to determine how reliable the correlation is and how much it is determined by the addition of independent variables. It is always lower than the R2.

Set-up

Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import cross_val_score

from sklearn import metrics
from collections import Counter

 

Defining function for regression metrics

def Reg_Models_Evaluation_Metrics (model,X_train,y_train,X_test,y_test,y_pred):
    cv_score = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 10)
    
    # Calculating Adjusted R-squared
    r2 = model.score(X_test, y_test)
    # Number of observations is the shape along axis 0
    n = X_test.shape[0]
    # Number of features (predictors, p) is the shape along axis 1
    p = X_test.shape[1]
    # Adjusted R-squared formula
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
    R2 = model.score(X_test, y_test)
    CV_R2 = cv_score.mean()

    return R2, adjusted_r2, CV_R2, RMSE
    
    print('RMSE:', round(RMSE,4))
    print('R2:', round(R2,4))
    print('Adjusted R2:', round(adjusted_r2, 4) )
    print("Cross Validated R2: ", round(cv_score.mean(),4) )

 

Data Sets Characteristics

Avocado Prices

https://www.kaggle.com/datasets/neuromusic/avocado-prices

Some relevant columns in the dataset:

  • Date - The date of the observation

  • AveragePrice - the average price of a single avocado

  • type - conventional or organic

  • year - the year

  • Region - the city or region of the observation

  • Total Volume - Total number of avocados sold

  • 4046 - Total number of avocados with PLU 4046 sold

  • 4225 - Total number of avocados with PLU 4225 sold

  • 4770 - Total number of avocados with PLU 4770 sold

Missing values: None

Duplicate entries: None

Boston House Prices

https://www.kaggle.com/datasets/vikrishnan/boston-house-prices

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository1): CRIM: per capita crime rate by town

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centres
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per 10 000 USD
  • PTRATIO pupil-teacher ratio by town
  • B 1000 (Bk - 0.63)^2 where Bk is the proportion of black people by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in $1000's

Missing values: None

Duplicate entries: None

This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

 

Import Data

column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

try:
    raw_df1 = pd.read_csv('../input/avocado-prices/avocado.csv')
    raw_df2 = pd.read_csv('../input/boston-house-prices/housing.csv', header = None, delimiter = r"\s+", names = column_names)
except:
    raw_df1 = pd.read_csv('avocado.csv')
    raw_df2 = pd.read_csv('housing.csv', header = None, delimiter = r"\s+", names = column_names)
# Deleting column
raw_df1 = raw_df1.drop('Unnamed: 0', axis = 1)
numeric_columns = ['AveragePrice', 'Total Volume','4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
categorical_columns = ['Region', 'Type']
time_columns = ['Data', 'Year']
numeric_columns_boston = ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Some visualisations

Avocado Prices

# Checking for distributions
def dist_custom(dataset, columns_list, rows, cols, suptitle):
    fig, axs = plt.subplots(rows, cols,figsize=(16,16))
    fig.suptitle(suptitle,y=1, size=25)
    axs = axs.flatten()
    for i, data in enumerate(columns_list):
        sns.kdeplot(dataset[data], ax=axs[i], fill=True,  alpha=.5, linewidth=0)
        axs[i].set_title(data + ', skewness is '+str(round(dataset[data].skew(axis = 0, skipna = True),2)))

dist_custom(dataset=raw_df1, columns_list=numeric_columns, rows=3, cols=3, suptitle='Avocado Prices: distibution for each numeric variable')
plt.tight_layout()
Nine density plots visualizing avocado price and volume data. Each plot shows a skewed distribution with text labels.

 

Boston House Prices

dist_custom(dataset=raw_df2, columns_list=numeric_columns_boston, rows=4, cols=3, suptitle='Boston House Prices: distibution for each numeric variable')
plt.tight_layout()
Twelve density plots visualize Boston housing prices for each numeric value.

Data pre-processing

Some transformations

# Changing data types
for i in raw_df1.columns:
    if i == 'Date':
        raw_df1[i] = raw_df1[i].astype('datetime64[ns]')
    elif raw_df1[i].dtype == 'object':
        raw_df1[i] = raw_df1[i].astype('category')
df1 = raw_df1.copy()

df1['Date'] = pd.to_datetime(df1['Date'])
df1['month'] = df1['Date'].dt.month

df1['Spring'] = df1['month'].between(3,5,inclusive='both')
df1['Summer'] = df1['month'].between(6,8,inclusive='both')
df1['Fall'] = df1['month'].between(9,11,inclusive='both')
# df1['Winter'] = df1['month'].between(12,2,inclusive='both')

df1.Spring = df1.Spring.replace({True: 1, False: 0})
df1.Summer = df1.Summer.replace({True: 1, False: 0})
df1.Fall = df1.Fall.replace({True: 1, False: 0})
# Encoding labels for 'type'
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1['type'] = le.fit_transform(df1['type'])

# Encoding 'region' (One Hot Encoding)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', handle_unknown='ignore')
ohe = pd.get_dummies(data=df1, columns=['region'])

df1 = ohe.drop(['Date','4046','4225','4770','Small Bags','Large Bags','XLarge Bags'], axis=1)

 

Outlier detection and removal

We have a significant problems with outliers in both data sets:

  • most of the distributions are not normal;

  • huge outliers;

  • higly right-skeved data in Avocado Prices data set;

  • a lot of outliers.

Tukey’s (1977) technique is used to detect outliers in skewed or non bell-shaped data since it makes no distributional assumptions. However, Tukey’s method may not be appropriate for a small sample size. The general rule is that anything not in the range of (Q1 - 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed.

Inter Quartile Range (IQR) is one of the most extensively used procedure for outlier detection and removal.

Procedure:

  1. Find the first quartile, Q1.
  2. Find the third quartile, Q3.
  3. Calculate the IQR. IQR = Q3-Q1.
  4. Define the normal data range with lower limit as Q1–1.5 IQR and upper limit as Q3+1.5 IQR.

For oulier detection methods look here: https://www.kaggle.com/code/marcinrutecki/outlier-detection-methods

def IQR_method (df,n,features):
    """
    Takes a dataframe and returns an index list corresponding to the observations 
    containing more than n outliers according to the Tukey IQR method.
    """
    outlier_list = []
    
    for column in features:
                
        # 1st quartile (25%)
        Q1 = np.percentile(df[column], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[column],75)
        
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determining a list of indices of outliers
        outlier_list_column = df[(df[column] < Q1 - outlier_step) | (df[column] > Q3 + outlier_step )].index
        
        # appending the list of outliers 
        outlier_list.extend(outlier_list_column)
        
    # selecting observations containing more than x outliers
    outlier_list = Counter(outlier_list)        
    multiple_outliers = list( k for k, v in outlier_list.items() if v > n )
    
    # Calculate the number of records below and above lower and above bound value respectively
    df1 = df[df[column] < Q1 - outlier_step]
    df2 = df[df[column] > Q3 + outlier_step]
    
    print('Total number of deleted outliers:', df1.shape[0]+df2.shape[0])
    
    return multiple_outliers
numeric_columns2 = ['Total Volume', 'Total Bags']

Outliers_IQR = IQR_method(df1,1,numeric_columns2)
# dropping outliers
df1 = df1.drop(Outliers_IQR, axis = 0).reset_index(drop=True)
Total number of deleted outliers: 2533
numeric_columns2 = ['CRIM', 'ZN', 'NOX', 'RM', 'AGE', 'DIS', 'PTRATIO', 'B', 'LSTAT']

Outliers_IQR = IQR_method(raw_df2,1,numeric_columns2)
# dropping outliers
df2 = raw_df2.drop(Outliers_IQR, axis = 0).reset_index(drop=True)
Total number of deleted outliers: 7

 

Train test split

X = df1.drop('AveragePrice', axis=1)
y = df1['AveragePrice']

X2 = raw_df2.iloc[:, :-1]
y2 = raw_df2.iloc[:, -1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size = 0.3, random_state = 42)

 

Feature scaling

from sklearn.preprocessing import StandardScaler

# Creating function for scaling
def Standard_Scaler (df, col_names):
    features = df[col_names]
    scaler = StandardScaler().fit(features.values)
    features = scaler.transform(features.values)
    df[col_names] = features
    
    return df
col_names = ['Total Volume', 'Total Bags']
X_train = Standard_Scaler (X_train, col_names)
X_test = Standard_Scaler (X_test, col_names)

col_names = ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
X_train2 = Standard_Scaler (X_train2, col_names)
X_test2 = Standard_Scaler (X_test2, col_names)

Comparing different models

Linear Regression

from sklearn.linear_model import LinearRegression

# Creating and training model
lm = LinearRegression()
lm.fit(X_train, y_train)

# Model making a prediction on test data
y_pred = lm.predict(X_test)

 

Linear Regression performance for Avocado dataset

ndf = [Reg_Models_Evaluation_Metrics(lm,X_train,y_train,X_test,y_test,y_pred)]

lm_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
lm_score.insert(0, 'Model', 'Linear Regression')
lm_score
Linear Regression Model Performance Metrics
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Linear Regression 0.598793 0.593598 0.604281 0.255931


plt.figure(figsize = (10,5))
sns.regplot(x=y_test,y=y_pred)
plt.title('Linear regression for Avocado dataset', fontsize = 20)
Text(0.5, 1.0, 'Linear regression for Avocado dataset')

Linear Regression performance for Boston dataset

lm.fit(X_train2, y_train2)
y_pred = lm.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(lm,X_train2,y_train2,X_test2,y_test2,y_pred)]

lm_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
lm_score2.insert(0, 'Model', 'Linear Regression')
lm_score2
Comparison of different regression models based on R2, adjusted R2, cross-validated R2, and RMSE scores.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Linear Regression 0.679168 0.648945 0.687535 4.889394


Random Forest

from sklearn.ensemble import RandomForestRegressor

# Creating and training model
RandomForest_reg = RandomForestRegressor(n_estimators = 10, random_state = 0)

Random Forest performance for Avocado dataset

RandomForest_reg.fit(X_train, y_train)
# Model making a prediction on test data
y_pred = RandomForest_reg.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(RandomForest_reg,X_train,y_train,X_test,y_test,y_pred)]

rf_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rf_score.insert(0, 'Model', 'Random Forest')
rf_score
Performance metrics for Random Forest model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Random Forest 0.78712 0.784363 0.876525 0.186426

 

Random Forest performance for Boston dataset

RandomForest_reg.fit(X_train2, y_train2)
# Model making a prediction on test data
y_pred = RandomForest_reg.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(RandomForest_reg,X_train2,y_train2,X_test2,y_test2,y_pred)]

rf_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rf_score2.insert(0, 'Model', 'Random Forest')
rf_score2
Performance metrics for Random Forest model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Random Forest 0.838576 0.823369 0.817514 3.468169

 

Ridge Regression

from sklearn.linear_model import Ridge

# Creating and training model
ridge_reg = Ridge(alpha=3, solver="cholesky")

 

Ridge Regression performance for Avocado dataset

ridge_reg.fit(X_train, y_train)
# Model making a prediction on test data
y_pred = ridge_reg.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(ridge_reg,X_train,y_train,X_test,y_test,y_pred)]

rr_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rr_score.insert(0, 'Model', 'Ridge Regression')
rr_score
Performance metrics for Ridge Regression model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Ridge Regression 0.598733 0.593537 0.604317 0.25595

 

Ridge Regression performance for Boston dataset

ridge_reg.fit(X_train2, y_train2)
# Model making a prediction on test data
y_pred = ridge_reg.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(ridge_reg,X_train2,y_train2,X_test2,y_test2,y_pred)]

rr_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rr_score2.insert(0, 'Model', 'Ridge Regression')
rr_score2
Performance metrics for Ridge Regression model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Ridge Regression 0.678696 0.648428 0.689293 4.892991

 

XGBoost

from xgboost import XGBRegressor
# create an xgboost regression model
XGBR = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.8, colsample_bytree=0.8)

 

XGBoost performance for Avocado dataset

XGBR.fit(X_train, y_train)
# Model making a prediction on test data
y_pred = XGBR.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(XGBR,X_train,y_train,X_test,y_test,y_pred)]

XGBR_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
XGBR_score.insert(0, 'Model', 'XGBoost')
XGBR_score
Performance metrics for XGBoost model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 XGBoost 0.798641 0.796034 0.911125 0.181311

 

XGBoost performance for Boston dataset

XGBR.fit(X_train2, y_train2)
# Model making a prediction on test data
y_pred = XGBR.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(XGBR,X_train2,y_train2,X_test2,y_test2,y_pred)]

XGBR_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
XGBR_score2.insert(0, 'Model', 'XGBoost')
XGBR_score2
Performance metrics for XGBoost model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 XGBoost 0.901889 0.892646 0.845593 2.70381

 

Recursive Feature Elimination (RFE)

RFE is a wrapper-type feature selection algorithm. This means that a different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features.

Random Forest has usually good performance combining with RFE

from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline

# create pipeline
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=60)
model = RandomForestRegressor()
rf_pipeline = Pipeline(steps=[('s',rfe),('m',model)])

 

Random Forest RFE performance for Avocado dataset

rf_pipeline.fit(X_train, y_train)
# Model making a prediction on test data
y_pred = rf_pipeline.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(rf_pipeline,X_train,y_train,X_test,y_test,y_pred)]

rfe_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rfe_score.insert(0, 'Model', 'Random Forest with RFE')
rfe_score
Performance metrics for Random Forest with RFE model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Random Forest with RFE 0.800169 0.797581 0.889159 0.180622

 

Random Forest RFE performance for Boston dataset

# create pipeline
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=8)
model = RandomForestRegressor()
rf_pipeline = Pipeline(steps=[('s',rfe),('m',model)])

rf_pipeline.fit(X_train2, y_train2)
# Model making a prediction on test data
y_pred = rf_pipeline.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(rf_pipeline,X_train2,y_train2,X_test2,y_test2,y_pred)]

rfe_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
rfe_score2.insert(0, 'Model', 'Random Forest with RFE')
rfe_score2
Performance metrics for Random Forest with RFE model including R2, adjusted R2, cross-validated R2, and RMSE.
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Random Forest with RFE 0.839377 0.824246 0.82114 3.45955

Final Model Evaluation

Avocado dataset

predictions = pd.concat([rfe_score, XGBR_score, rr_score, rf_score, lm_score], ignore_index=True, sort=False)
predictions
Comparative Performance of Regression Models
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Random Forest with RFE 0.800169 0.797581 0.889159 0.180622
1 XGBoost 0.798641 0.796034 0.911125 0.181311
2 Ridge Regression 0.598733 0.593537 0.604317 0.255950
3 Random Forest 0.787120 0.784363 0.876525 0.186426
4 Linear Regression 0.598793 0.593598 0.604281 0.255931

 

Boston dataset

predictions2 = pd.concat([rfe_score2, XGBR_score2, rr_score2, rf_score2, lm_score2], ignore_index=True, sort=False)
predictions2
Regression Model Performance Metrics
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Random Forest with RFE 0.839377 0.824246 0.821140 3.459550
1 XGBoost 0.901889 0.892646 0.845593 2.703810
2 Ridge Regression 0.678696 0.648428 0.689293 4.892991
3 Random Forest 0.838576 0.823369 0.817514 3.468169
4 Linear Regression 0.679168 0.648945 0.687535 4.889394

 

Visualizing Model Performance

f, axe = plt.subplots(1,1, figsize=(18,6))

predictions.sort_values(by=['Cross Validated R2 Score'], ascending=False, inplace=True)

sns.barplot(x='Cross Validated R2 Score', y='Model', data = predictions, ax = axe)
axe.set_xlabel('Cross Validated R2 Score', size=16)
axe.set_ylabel('Model')
axe.set_xlim(0,1.0)

axe.set(title='Model Performance for Avocado dataset')

plt.show()
Horizontal bar chart comparing the performance of 5 models on an Avocado dataset; XGBoost highest, Linear Regression lowest.

f, axe = plt.subplots(1,1, figsize=(18,6))

predictions2.sort_values(by=['Cross Validated R2 Score'], ascending=False, inplace=True)

sns.barplot(x='Cross Validated R2 Score', y='Model', data = predictions2, ax = axe)
axe.set_xlabel('Cross Validated R2 Score', size=16)
axe.set_ylabel('Model')
axe.set_xlim(0,1.0)

axe.set(title='Model Performance for Boston dataset')
plt.show()
Horizontal bar chart showing cross-validated R2 scores of 5 models. XGBoost performs best, Linear Regression performs worse

Bonus: hyperparameter Tuning Using GridSearchCV

Hyperparameter tuning is the process of tuning the parameters present as the tuples while we build machine learning models. These parameters are defined by us. Machine learning algorithms never learn these parameters. These can be tuned in different step.

GridSearchCV is a technique for finding the optimal hyperparameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.

The "best" parameters that GridSearchCV identifies are technically the best that could be produced, but only by the parameters that you included in your parameter grid.


Tuned Ridge Regression

from sklearn.preprocessing import PolynomialFeatures

# Polynomial features are those features created by raising existing features to an exponent. 
# For example, if a dataset had one input feature X, 
# then a polynomial feature would be the addition of a new feature (column) where values were calculated by squaring the values in X, e.g. X^2.

steps = [
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge(alpha=3.8, fit_intercept=True))
]

ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)

# Model making a prediction on test data
y_pred = ridge_pipe.predict(X_test)
from sklearn.model_selection import GridSearchCV

alpha_params = {'model__alpha': list(range(1, 15))}

clf = GridSearchCV(ridge_pipe, alpha_params, cv = 10)


Tuned Ridge Regression performance for Avocado dataset

# Fit and tune model
clf.fit(X_train, y_train)
# Model making a prediction on test data
y_pred = ridge_pipe.predict(X_test)
# The combination of hyperparameters along with values that give the best performance of our estimate specified
print(clf.best_params_)
{'model__alpha': 1}
ndf = [Reg_Models_Evaluation_Metrics(clf,X_train,y_train,X_test,y_test,y_pred)]

clf_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
clf_score.insert(0, 'Model', 'Tuned Ridge Regression')
clf_score
Tuned Ridge Regression Model Performance Metrics
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Tuned Ridge Regression 0.736622 0.733212 0.739008 0.210438


Tuned Ridge Regression performance for Boston dataset

steps = [
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge(alpha=3.8, fit_intercept=True))
]

ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train2, y_train2)

# Model making a prediction on test data
y_pred = ridge_pipe.predict(X_test2)

alpha_params = {'model__alpha': list(range(1, 15))}

clf = GridSearchCV(ridge_pipe, alpha_params, cv = 10)
# Fit and tune model
clf.fit(X_train2, y_train2)
# Model making a prediction on test data
y_pred = ridge_pipe.predict(X_test2)
# The combination of hyperparameters along with values that give the best performance of our estimate specified
print(clf.best_params_)
{'model__alpha': 12}
ndf = [Reg_Models_Evaluation_Metrics(clf,X_train2,y_train2,X_test2,y_test2,y_pred)]

clf_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE'])
clf_score2.insert(0, 'Model', 'Tuned Ridge Regression')
clf_score2
Tuned Ridge Regression: Final Model Performance Metrics
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Tuned Ridge Regression 0.793267 0.773792 0.844628 3.965999

Final performance comparison

Avocado data set

result = pd.concat([clf_score, predictions], ignore_index=True, sort=False)
result
Regression Model Performance Comparison
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Tuned Ridge Regression 0.736622 0.733212 0.739008 0.210438
1 XGBoost 0.798641 0.796034 0.911125 0.181311
2 Random Forest with RFE 0.800169 0.797581 0.889159 0.180622
3 Random Forest 0.787120 0.784363 0.876525 0.186426
4 Ridge Regression 0.598733 0.593537 0.604317 0.255950
5 Linear Regression 0.598793 0.593598 0.604281 0.255931

 

f, axe = plt.subplots(1,1, figsize=(18,6))

result.sort_values(by=['Cross Validated R2 Score'], ascending=False, inplace=True)

sns.barplot(x='Cross Validated R2 Score', y='Model', data = result, ax = axe)
#axes[0].set(xlabel='Region', ylabel='Charges')
axe.set_xlabel('Cross Validated R2 Score', size=16)
axe.set_ylabel('Model')
axe.set_xlim(0,1.0)
axe.set(title='Model Performance for Avocado dataset')

plt.show()

 

Model performance of avocado data. Models are ranked by cross-validated R2 scores: XGBoost and Random Forest with RFE are top

 

Boston data set

result = pd.concat([clf_score2, predictions2], ignore_index=True, sort=False)
result
Regression Model Performance Metrics (Final Comparison)
Model R2 Score Adjusted R2 Score Cross Validated R2 Score RMSE
0 Tuned Ridge Regression 0.793267 0.773792 0.844628 3.965999
1 XGBoost 0.901889 0.892646 0.845593 2.703810
2 Random Forest with RFE 0.839377 0.824246 0.821140 3.459550
3 Random Forest 0.838576 0.823369 0.817514 3.468169
4 Ridge Regression 0.678696 0.648428 0.689293 4.892991
5 Linear Regression 0.679168 0.648945 0.687535 4.889394

 

f, axe = plt.subplots(1,1, figsize=(18,6))

result.sort_values(by=['Cross Validated R2 Score'], ascending=False, inplace=True)

sns.barplot(x='Cross Validated R2 Score', y='Model', data = result, ax = axe)
#axes[0].set(xlabel='Region', ylabel='Charges')
axe.set_xlabel('Cross Validated R2 Score', size=16)
axe.set_ylabel('Model')
axe.set_xlim(0,1.0)
axe.set(title='Model Performance for Boston dataset')

plt.show()

 

Boston dataset model performance. Horizontal bar chart shows XGBoost with highest score, and Linear Regression with lowest