Introduction

In this course, we have already seen several key machine learning algorithms. However, before moving on to the more fancy ones, we'd like to take a small detour and talk about data preparation. The well-known concept of "garbage in  - garbage out" applies 100% to any task in machine learning. Any experienced professional can recall numerous times when a simple model trained on high-quality data was proven to be better than a complicated multi-model ensemble built on data that wasn't clean.

To start, I wanted to review three similar but different tasks:

  • feature extraction and feature engineering: transformation of raw data into features suitable for modeling;

  • feature transformation: transformation of data to improve the accuracy of the algorithm;

  • feature selection: removing unnecessary features.

This article will contain almost no math, but there will be a fair amount of code. Some examples will use the dataset from Renthop company, which is used in the Two Sigma Connect: Rental Listing Inquiries Kaggle competition. The file train.json is also kept here as renthop_train.json.gz (so do unpack it first). In this task, you need to predict the popularity of a new rental listing, i.e. classify the listing into three classes: ['low', 'medium' , 'high']. To evaluate the solutions, we will use the log loss metric (the smaller, the better). Those who do not have a Kaggle account, will have to register; you will also need to accept the rules of the competition in order to download the data.

# preload dataset automatically, if not already in place.
import os
from pathlib import Path
from pprint import pprint
import numpy as np
import pandas as pd

Source: Yury Kashnitsky, https://mlcourse.ai/book/topic06/topic6_feature_engineering_feature_selection.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.