Feature transformations

Normalization and changing distribution

Monotonic feature transformation is critical for some algorithms and has no effect on others. This is one of the reasons for the increased popularity of decision trees and all its derivative algorithms (random forest, gradient boosting). Not everyone can or want to tinker with transformations, and these algorithms are robust to unusual distributions.

There are also purely engineering reasons: np.log is a way of dealing with large numbers that do not fit in np.float64. This is an exception rather than a rule; often it's driven by the desire to adapt the dataset to the requirements of the algorithm. Parametric methods usually require the data distribution to be at least symmetric and unimodal, which is not always the case. There may be more stringent requirements; recall our earlier article about linear models.

However, data requirements are imposed not only by parametric methods; K nearest neighbors will predict complete nonsense if features are not normalized e.g. when one distribution is located in the vicinity of zero and does not go beyond (-1, 1) while the other's range is on the order of hundreds of thousands.

A simple example: suppose that the task is to predict the cost of an apartment from two variables - the distance from city center and the number of rooms. The number of rooms rarely exceeds 5 whereas the distance from city center can easily be in the thousands of meters.

The simplest transformation is Standard Scaling (or Z-score normalization):

\(\large z= \frac{x-\mu}{\sigma}\)

Note that Standard Scaling does not make the distribution normal in the strict sense.

import numpy as np
from scipy.stats import beta, shapiro
from sklearn.preprocessing import StandardScaler

data = beta(1, 10).rvs(1000).reshape(-1, 1)
shapiro(data)

ShapiroResult(statistic=np.float64(0.8584622308590755), pvalue=np.float64(4.840065002002682e-29))

# Value of the statistic, p-value
shapiro(StandardScaler().fit_transform(data))

# With such p-value we'd have to reject the null hypothesis of normality of the data

ShapiroResult(statistic=np.float64(0.8584622308590759), pvalue=np.float64(4.8400650020030644e-29))

But, to some extent, it protects against outliers:

data = np.array([1, 1, 0, -1, 2, 1, 2, 3, -2, 4, 100]).reshape(-1, 1).astype(np.float64)
StandardScaler().fit_transform(data)

array([[-0.31922662],
       [-0.31922662],
       [-0.35434155],
       [-0.38945648],
       [-0.28411169],
       [-0.31922662],
       [-0.28411169],
       [-0.24899676],
       [-0.42457141],
       [-0.21388184],
       [ 3.15715128]])

(data - data.mean()) / data.std()

array([[-0.31922662],
       [-0.31922662],
       [-0.35434155],
       [-0.38945648],
       [-0.28411169],
       [-0.31922662],
       [-0.28411169],
       [-0.24899676],
       [-0.42457141],
       [-0.21388184],
       [ 3.15715128]])

Another fairly popular option is MinMax Scaling, which brings all the points within a predetermined interval (typically (0, 1)).

\(\large X_{norm}=\frac{X-X_{min}}{X_{max}-X_{min}}\)

from sklearn.preprocessing import MinMaxScaler

MinMaxScaler().fit_transform(data)

array([[0.02941176],
       [0.02941176],
       [0.01960784],
       [0.00980392],
       [0.03921569],
       [0.02941176],
       [0.03921569],
       [0.04901961],
       [0.        ],
       [0.05882353],
       [1.        ]])

(data - data.min()) / (data.max() - data.min())

array([[0.02941176],
       [0.02941176],
       [0.01960784],
       [0.00980392],
       [0.03921569],
       [0.02941176],
       [0.03921569],
       [0.04901961],
       [0.        ],
       [0.05882353],
       [1.        ]])

StandardScaling and MinMax Scaling have similar applications and are often more or less interchangeable. However, if the algorithm involves the calculation of distances between points or vectors, the default choice is StandardScaling. But MinMax Scaling is useful for visualization by bringing features within the interval (0, 255).

If we assume that some data is not normally distributed but is described by the log-normal distribution, it can easily be transformed to a normal distribution:

from scipy.stats import lognorm

data = lognorm(s=1).rvs(1000)
shapiro(data)

ShapiroResult(statistic=np.float64(0.5374207272599946), pvalue=np.float64(3.5465593735129234e-45))

shapiro(np.log(data))

ShapiroResult(statistic=np.float64(0.9992335965405731), pvalue=np.float64(0.964095456746809))

The lognormal distribution is suitable for describing salaries, price of securities, urban population, number of comments on articles on the internet, etc. However, to apply this procedure, the underlying distribution does not necessarily have to be lognormal; you can try to apply this transformation to any distribution with a heavy right tail. Furthermore, one can try to use other similar transformations, formulating their own hypotheses on how to approximate the available distribution to a normal. Examples of such transformations are Box-Cox transformation (logarithm is a special case of the Box-Cox transformation) or Yeo-Johnson transformation (extends the range of applicability to negative numbers). In addition, you can also try adding a constant to the feature - np.log (x + const).

In the examples above, we have worked with synthetic data and strictly tested normality using the Shapiro-Wilk test. Let's try to look at some real data and test for normality using a less formal method - Q-Q plot. For a normal distribution, it will look like a smooth diagonal line, and visual anomalies should be intuitively understandable.

../../_images/topic6_qq_log.png

Fig. 2 Q-Q plot for the same distribution after taking the logarithm

# Let's draw plots!
import statsmodels.api as sm

# Let's take the price feature from Renthop dataset and filter by hands the most extreme values for clarity

price = df.price[(df.price <= 20000) & (df.price > 500)]
price_log = np.log(price)

# A lot of gestures so that sklearn didn't shower us with warnings
price_mm = (
    MinMaxScaler()
    .fit_transform(price.values.reshape(-1, 1).astype(np.float64))
    .flatten()
)
price_z = (
    StandardScaler()
    .fit_transform(price.values.reshape(-1, 1).astype(np.float64))
    .flatten()
)

Q-Q plot of the initial feature

sm.qqplot(price, loc=price.mean(), scale=price.std())

../../_images/1320fff5371b80cb9c2f2487a0cda7fe50464b062707726799ef0c35b846fbd0.png

Q-Q plot after StandardScaler. Shape doesn't change

sm.qqplot(price_z, loc=price_z.mean(), scale=price_z.std())

../../_images/822f80d9f334350ab0d8cef74fbb633ec46250c5e87e752b6a3647eb71a4bb17.png

Q-Q plot after MinMaxScaler. Shape doesn't change

sm.qqplot(price_mm, loc=price_mm.mean(), scale=price_mm.std())

../../_images/97d3585d9c0f8053f32cbb3dc98b88c7838daf2c67a75ebc2d43b4881f582f63.png

Q-Q plot after taking the logarithm. Things are getting better!

sm.qqplot(price_log, loc=price_log.mean(), scale=price_log.std())

../../_images/5f05d1cb8539c1789873961034c26424ace5c81346189a90b860e510bfb08e21.png

Let's see whether transformations can somehow help the real model. There is no silver bullet here.

Interactions

If previous transformations seemed rather math-driven, this part is more about the nature of the data; it can be attributed to both feature transformations and feature creation.

Let's come back again to the Two Sigma Connect: Rental Listing Inquiries problem. Among the features in this problem are the number of rooms and the price. Logic suggests that the cost per single room is more indicative than the total cost, so we can generate such a feature.

rooms = df["bedrooms"].apply(lambda x: max(x, 0.5))
# Avoid division by zero; .5 is chosen more or less arbitrarily
df["price_per_bedroom"] = df["price"] / rooms

You should limit yourself in this process. If there are a limited number of features, it is possible to generate all the possible interactions and then weed out the unnecessary ones using the techniques described in the next section. In addition, not all interactions between features must have a physical meaning; for example, polynomial features (see sklearn.preprocessing.PolynomialFeatures) are often used in linear models and are almost impossible to interpret.

Filling in the missing values

Not many algorithms can work with missing values, and the real world often provides data with gaps. Fortunately, this is one of the tasks for which one doesn't need any creativity. Both key python libraries for data analysis provide easy-to-use solutions: pandas.DataFrame.fillna and sklearn.preprocessing.Imputer.

These solutions do not have any magic happening behind the scenes. Approaches to handling missing values are pretty straightforward:

encode missing values with a separate blank value like "n/a" (for categorical variables);
use the most probable value of the feature (mean or median for the numerical variables, the most common value for categorical variables);
or, conversely, encode with some extreme value (good for decision-tree models since it allows the model to make a partition between the missing and non-missing values);
for ordered data (e.g. time series), take the adjacent value - next or previous.

Easy-to-use library solutions sometimes suggest sticking to something like df = df.fillna(0) and not sweat the gaps. But this is not the best solution: data preparation takes more time than building models, so thoughtless gap-filling may hide a bug in processing and damage the model.