Categorical variables

The Category handling and Missing values methods, and their related controls, specify how a categorical variable is handled.

  • Dummy-encoding (vectorization) creates a vector of 0/1 flags of length equal to the number of categories in the categorical variable. You can choose to drop one of the dummies so that they are not linearly dependent, or let Dataiku decide (in which case the least frequently occurring category is dropped). There is a limit on the number of dummies, which can be based on a maximum number of categories, the cumulative proportion of rows accounted for by the most popular rows, or a minimum number of samples per category.
  • Replace by 0/1 flag indicating presence
  • Impact-coding
  • Feature hashing (for high cardinality)

Missing values

There are a few choices for handling missing values in categorical and numerical features.

  • Treat as a regular value (categorical features only) treats missing values as a distinct category. This should be used for structurally missing data that are impossible to measure, e.g. the US state for an address in Canada.
  • Impute… replaces missing values with the specified value. This should be used for randomly missing data that are missing due to random noise.
  • Drop rows discards rows with missing values from the model building. Avoid discarding rows, unless missing data is extremely rare.