Numerical variables¶

The Numerical handling and Missing values methods, and their related controls, specify how a numerical variable is handled.

Numerical handling¶

Keep as a regular numerical feature simply takes the numerical input as is, with optional Rescaling. In addition, post-rescaling, you can request that derived features such as sqrt(x), x^2, … be generated and considered in the model.
Datetime cyclical encoding (Python training backend only).
Replace by 0/1 flag indicating presence.
Binarize based on a threshold replaces the feature values with a 0/1 flag that indicates whether the value is above or below the specified threshold.
Quantize replaces the feature values with their quantiles in the feature’s empirical distribution. More precisely if we set the number of quantiles to \(n\), the numerical feature will be split into \(n\) intervals (quantiles), each containing one \(nth\) of the feature values. Finally each numerical value is replaced by the index (from \(0\) to \(n - 1\)) of the interval it belongs to.

All numerical handlings (except 0/1 presence flag) offer the possibility to keep the original numerical feature as an extra feature that can in turn be rescaled.

Datetime cyclical encoding¶

Datetime cyclical encoding transforms datetime features (timestamps) into numerical features, while preserving the cyclical significance of date and time periods.

More specifically for every selected time period \(T\) (either minute, hour, day, week, month, quarter or year), the datetime cyclical encoding converts the timestamp or date to a number of seconds \(t\) and then encodes \(t\) into two numerical features using the following formulas:

\[\begin{split}\begin{cases} \sin\left(\dfrac{2\pi \cdot t}{T}\right) \\ \\ \cos\left(\dfrac{2\pi \cdot t}{T}\right) \end{cases}\end{split}\]

In order to take into account leap seconds and leap years, the timestamp is first converted to a number of seconds for each selected period. By way of example, we’ll detail the computation for the 2021-09-27T02:17:35 reference timestamp.

minute: \(t\) is defined as the number of seconds since the beginning of the same minute (i.e. 35 seconds in our example).
hour: \(t\) is defined as the number of seconds since the beginning of the same hour (i.e. 17*60 + 35 seconds in our example).
day: \(t\) is defined as the number of seconds since the beginning of the same day, at 00:00:00 (i.e. 2*3600 + 17*60 + 35 seconds in our example).
week: \(t\) is defined as the number of seconds since Monday of the same week, at 00:00:00 (i.e. since 2021-09-21T00:00:00 in our example).
month: \(t\) is defined as the number of seconds since the first day of the same month, at 00:00:00 (i.e. since 2021-09-01T00:00:00 in our example).
quarter: \(t\) is defined as the number of seconds since the first day of the same quarter,at 00:00:00 (i.e. since 2021-07-01T00:00:00 in our example).
year: \(t\) is defined as the number of seconds since the first day of the same year, at 00:00:00 (i.e. since 2021-01-01T00:00:00 in our example).

The reference period durations are:

\(T = 60\ s\) for minute,
60 minutes (\(T = 3600\ s\)) for hour,
24 hours (\(T = 86400\ s\)) for day,
7 days (\(T = 604800\ s\)) for week,
31 days (\(T = 2678400\ s\)) for month,
92 days (\(T = 7948800\ s\)) for quarter,
366 days (\(T = 31622400\ s\)) for year.

Rescaling¶

Rescaling can be performed prior to training, which can improve model performance in some instances. We advise to rescale numeric variables in the following cases:

Algorithms that are not based on decision trees (rescaling has no effect on decision trees) are selected.
There are large differences in the absolute values of the features.

There are two implementations of rescaling.

Standard rescaling scales the feature to a standard deviation of one and a mean of zero (default setting).

Min-max rescaling sets the minimum value of the feature to zero and the max to one.

Missing values¶

There are a few choices for handling missing values in numerical features.

Impute… replaces missing values with the specified value. This should be used for randomly missing data that are missing due to random noise.
Drop rows discards rows with missing values from the model building. Avoid discarding rows, unless missing data is extremely rare.