Numerical variables

The Numerical handling and Missing values methods, and their related controls, specify how a numerical variable is handled.

  • Keep as a regular numerical feature allows for rescaling prior to training, which can improve model performance in some instances. Standard rescaling scales the feature to a standard deviation of one and a mean of zero. Min-max rescaling sets the minimum value of the feature to zero and the max to one. In addition, post-rescaling, you can request that derived features such as sqrt(x), x^2, … be generated and considered in the model. Rescale numeric variables if there are large differences in the absolute values of the features.
  • Replace by 0/1 flag indicating presence
  • Binarize based on a threshold replaces the feature values with a 0/1 flag that indicates whether the value is above or below the specified threshold.
  • Quantize replaces the feature values with the quantiles of the feature’s empirical distribution.

Missing values

There are a few choices for handling missing values in categorical and numerical features.

  • Treat as a regular value (categorical features only) treats missing values as a distinct category. This should be used for structurally missing data that are impossible to measure, e.g. the US state for an address in Canada.
  • Impute… replaces missing values with the specified value. This should be used for randomly missing data that are missing due to random noise.
  • Drop rows discards rows with missing values from the model building. Avoid discarding rows, unless missing data is extremely rare.