Completion requirements
View
This is a book resource with multiple pages. Navigate between the pages using the
buttons.
Comparing different models
Linear Regression
from sklearn.linear_model import LinearRegression # Creating and training model lm = LinearRegression() lm.fit(X_train, y_train) # Model making a prediction on test data y_pred = lm.predict(X_test)
Linear Regression performance for Avocado dataset
ndf = [Reg_Models_Evaluation_Metrics(lm,X_train,y_train,X_test,y_test,y_pred)] lm_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) lm_score.insert(0, 'Model', 'Linear Regression') lm_score
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.598793 | 0.593598 | 0.604281 | 0.255931 |
plt.figure(figsize = (10,5))
sns.regplot(x=y_test,y=y_pred)
plt.title('Linear regression for Avocado dataset', fontsize = 20)
Text(0.5, 1.0, 'Linear regression for Avocado dataset')

Linear Regression performance for Boston dataset
lm.fit(X_train2, y_train2) y_pred = lm.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(lm,X_train2,y_train2,X_test2,y_test2,y_pred)] lm_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) lm_score2.insert(0, 'Model', 'Linear Regression') lm_score2
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.679168 | 0.648945 | 0.687535 | 4.889394 |
Random Forest
from sklearn.ensemble import RandomForestRegressor # Creating and training model RandomForest_reg = RandomForestRegressor(n_estimators = 10, random_state = 0)
Random Forest performance for Avocado dataset
RandomForest_reg.fit(X_train, y_train) # Model making a prediction on test data y_pred = RandomForest_reg.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(RandomForest_reg,X_train,y_train,X_test,y_test,y_pred)] rf_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) rf_score.insert(0, 'Model', 'Random Forest') rf_score
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Random Forest | 0.78712 | 0.784363 | 0.876525 | 0.186426 |
Random Forest performance for Boston dataset
RandomForest_reg.fit(X_train2, y_train2) # Model making a prediction on test data y_pred = RandomForest_reg.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(RandomForest_reg,X_train2,y_train2,X_test2,y_test2,y_pred)] rf_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) rf_score2.insert(0, 'Model', 'Random Forest') rf_score2
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Random Forest | 0.838576 | 0.823369 | 0.817514 | 3.468169 |
Ridge Regression
from sklearn.linear_model import Ridge # Creating and training model ridge_reg = Ridge(alpha=3, solver="cholesky")
Ridge Regression performance for Avocado dataset
ridge_reg.fit(X_train, y_train) # Model making a prediction on test data y_pred = ridge_reg.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(ridge_reg,X_train,y_train,X_test,y_test,y_pred)] rr_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) rr_score.insert(0, 'Model', 'Ridge Regression') rr_score
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Ridge Regression | 0.598733 | 0.593537 | 0.604317 | 0.25595 |
Ridge Regression performance for Boston dataset
ridge_reg.fit(X_train2, y_train2) # Model making a prediction on test data y_pred = ridge_reg.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(ridge_reg,X_train2,y_train2,X_test2,y_test2,y_pred)] rr_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) rr_score2.insert(0, 'Model', 'Ridge Regression') rr_score2
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Ridge Regression | 0.678696 | 0.648428 | 0.689293 | 4.892991 |
XGBoost
from xgboost import XGBRegressor # create an xgboost regression model XGBR = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.8, colsample_bytree=0.8)
XGBoost performance for Avocado dataset
XGBR.fit(X_train, y_train) # Model making a prediction on test data y_pred = XGBR.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(XGBR,X_train,y_train,X_test,y_test,y_pred)] XGBR_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) XGBR_score.insert(0, 'Model', 'XGBoost') XGBR_score
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | XGBoost | 0.798641 | 0.796034 | 0.911125 | 0.181311 |
XGBoost performance for Boston dataset
XGBR.fit(X_train2, y_train2) # Model making a prediction on test data y_pred = XGBR.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(XGBR,X_train2,y_train2,X_test2,y_test2,y_pred)] XGBR_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) XGBR_score2.insert(0, 'Model', 'XGBoost') XGBR_score2
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | XGBoost | 0.901889 | 0.892646 | 0.845593 | 2.70381 |
Recursive Feature Elimination (RFE)
RFE is a wrapper-type feature selection algorithm. This means that a different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features.
Random Forest has usually good performance combining with RFE
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
# create pipeline
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=60)
model = RandomForestRegressor()
rf_pipeline = Pipeline(steps=[('s',rfe),('m',model)])
Random Forest RFE performance for Avocado dataset
rf_pipeline.fit(X_train, y_train) # Model making a prediction on test data y_pred = rf_pipeline.predict(X_test)
ndf = [Reg_Models_Evaluation_Metrics(rf_pipeline,X_train,y_train,X_test,y_test,y_pred)] rfe_score = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) rfe_score.insert(0, 'Model', 'Random Forest with RFE') rfe_score
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Random Forest with RFE | 0.800169 | 0.797581 | 0.889159 | 0.180622 |
Random Forest RFE performance for Boston dataset
# create pipeline
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=8)
model = RandomForestRegressor()
rf_pipeline = Pipeline(steps=[('s',rfe),('m',model)])
rf_pipeline.fit(X_train2, y_train2)
# Model making a prediction on test data
y_pred = rf_pipeline.predict(X_test2)
ndf = [Reg_Models_Evaluation_Metrics(rf_pipeline,X_train2,y_train2,X_test2,y_test2,y_pred)] rfe_score2 = pd.DataFrame(data = ndf, columns=['R2 Score','Adjusted R2 Score','Cross Validated R2 Score','RMSE']) rfe_score2.insert(0, 'Model', 'Random Forest with RFE') rfe_score2
| Model | R2 Score | Adjusted R2 Score | Cross Validated R2 Score | RMSE | |
|---|---|---|---|---|---|
| 0 | Random Forest with RFE | 0.839377 | 0.824246 | 0.82114 | 3.45955 |