Introduction

How can we derive new features from existing ones in datasets such as customer transactions or medical records? Why does selecting the right features impact both accuracy and efficiency?

The resource covers feature selection techniques, which help identify the most informative features while discarding redundant ones. As you study these methods, think about the balance between having more features and ensuring model simplicity and interpretability.

In this course, we have already seen several key machine learning algorithms. However, before moving on to the more fancy ones, we'd like to take a small detour and talk about data preparation. The well-known concept of "garbage in - garbage out" applies 100% to any task in machine learning. Any experienced professional can recall numerous times when a simple model trained on high-quality data was proven to be better than a complicated multi-model ensemble built on data that wasn't clean.

To start, I wanted to review three similar but different tasks:

feature extraction and feature engineering: transformation of raw data into features suitable for modeling;
feature transformation: transformation of data to improve the accuracy of the algorithm;
feature selection: removing unnecessary features.

This article will contain almost no math, but there will be a fair amount of code. Some examples will use the dataset from Renthop company, which is used in the Two Sigma Connect: Rental Listing Inquiries Kaggle competition. The file train.json is also kept here as renthop_train.json.gz (so do unpack it first). In this task, you need to predict the popularity of a new rental listing, i.e. classify the listing into three classes: ['low', 'medium' , 'high']. To evaluate the solutions, we will use the log loss metric (the smaller, the better). Those who do not have a Kaggle account, will have to register; you will also need to accept the rules of the competition in order to download the data.

# preload dataset automatically, if not already in place.
import os
from pathlib import Path
from pprint import pprint
import numpy as np
import pandas as pd

Source: Yury Kashnitsky, https://mlcourse.ai/book/topic06/topic6_feature_engineering_feature_selection.html
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.