Self-Supervised Tabular Embeddings¶
This capability provides recipes to automatically learn and generate dense feature representations (embeddings) from tabular data using a self-supervised deep learning approach. It is provided by the Tabular Self-Supervised Learning plugin, which you need to install. Please see Installing plugins.
Overview¶
Self-supervised learning (SSL) helps discover complex patterns and relationships in your data without requiring target labels. This is achieved by training a Masked-Feature Autoencoder—a neural network designed to reconstruct its own input.
During training, the system intentionally hides (masks) a percentage of features in each row. The model is forced to predict these missing values by learning the underlying functional dependencies between all variables in the dataset. This process produces an encoder capable of translating raw, sparse data into a dense vector of “latent” features that capture the semantic essence of the record.
Tabular Embedding Pretraining recipe¶
This recipe trains the autoencoder on a representative dataset to learn the feature interactions.
Input¶
A training dataset (no labels required).
Outputs¶
A managed folder containing the trained model weights and preprocessing metadata
A metrics dataset to monitor training convergence
Parameters¶
Columns to ignore: Features to exclude from the learning process, such as potential future target variables.
Embedding dimension: The size of the final latent vector; smaller dimensions force higher compression.
Mask ratio: The fraction of input features hidden during each training step (e.g., 0.3).
Training process: Standard hyperparameters including Epochs, Batch size, and Learning rate.
Cross-validation: Optional K-Fold validation to ensure the model generalizes to unseen data.
Tabular Embedding Inference recipe¶
This recipe uses a pretrained encoder to transform datasets into dense feature sets.
Input¶
The dataset to transform
A Managed Folder containing the model to use for inference.
Outputs¶
The original dataset with an additional column containing the learned vector representation.
Parameters¶
Embedding column name: The name for the new vector column (defaults to
embedding).Batch size: The number of rows processed simultaneously during inference.
Feature Handling¶
The recipes include a built-in preprocessing layer to ensure data is compatible with deep learning:
Scaling & Encoding: Numerical values are mean-scaled, and categorical values are one-hot encoded.
Missing Values: Handled via “missing” tokens for categories and mean-imputation for numerical columns.
Concept-Aware Masking: The masking logic hides entire one-hot encoded groups together, forcing the model to learn the high-level semantic concept of a category.
Use Cases¶
Augmenting Small Labeled Datasets: Pretrain the autoencoder on a large unlabeled dataset to provide supervised models with “pre-learned” domain knowledge.
Automated Feature Engineering: Capture high-order, non-linear interactions between dozens of variables without manual intervention.
Dimensionality Reduction: Compress high-cardinality or sparse datasets into a dense, lower-dimensional space while preserving structural information.