MLLib (Spark) engine¶
MLLib is Spark’s machine learning library. DSS can use it to train prediction or clustering models on your large datasets that don’t fit into memory.
Warning
Spark’s overhead is non-negligible and its support is limited (see Limitations).
If your data fits into memory, you should use regular in-memory ML instead for faster learning and more extensive options and algorithms.
Usage¶
When you create a new machine learning in an Analysis, you can select the backend. By default it’s Python in memory, but if you have Spark correctly set up, you can also see Spark MLLib. Select it and your model will be trained on Spark, using algorithms available in MLLib or your custom MLLib-compatible models.
You can then fine-tune your model, deploy it in the Flow as a retrainable model and apply it in a scoring recipe to perform prediction on unlabelled datasets. Clustering models may also be retrained on new datasets through the cluster recipe.
In the model’s settings and the training, scoring and cluster recipes, there is an additional Spark config section, in which you can:
Change the base Spark configuration
Add / override Spark configuration options
Select the storage level of the dataset for caching once the data is loaded and prepared
Select the number of Spark RDD partitions to split non-HDFS input datasets
See DSS and Spark for more information about Spark in Data Science Studio.
Prediction Algorithms¶
DSS 12 supports the following algorithms on MLLib:
Logistic Regression (classification)
Linear Regression (regression)
Decision Trees (classification & regression)
Random Forest (classification & regression)
Gradient Boosted Trees (binary classification & regression)
Naive Bayes (multiclass classification)
Custom models
Clustering algorithms¶
DSS 12 supports the following algorithms on MLLib:
KMeans (clustering)
Gaussian Mixtures (clustering)
Custom models
Custom Models¶
Models using custom code may be trained with the MLLib backend. To train such a model,
Implement classes extending the
Estimator
andModel
classes of theorg.apache.spark.ml
package. These will be the classes used to train your model: DSS will call thefit(DataFrame)
method of yourEstimator
and thetransform(DataFrame)
method of yourModel
.Package your classes and all necessary classes in a jar, and place it in the
lib/java
folder of your data directory.In DSS, open your MLLib model settings and add a new custom algorithm in the algorithm list.
Place the initialization (scala) code for your
Estimator
into the code editor, together with any necessaryimport
statements. The initialization statement should be the last to be called. Note that declaring classes (including anonymous classes) in the editor is not recommended, as it may cause serialization errors. They should therefore be compiled and put in the jar.
Limitations¶
On top of the general Spark limitations in DSS, MLLib has specific limitations:
Gradient Boosted Trees in MLLib does not output per-class probabilities, so there is no threshold to set, and some metrics (AUC, Log loss, Lift) are not available, as are some report sections (variable importance, decision & lift charts, ROC curve).
Some feature preprocessing options are not available (although most can be achieved by other means):
Feature combinations
Numerical handling other than regular
Categorical handling other than dummy encoding
Text handling other than Tokenize, hash & count
Dimensionality-reduction for clustering
If test dataset is larger than 1 million rows, it will be subsampled to ~1M rows for performance and memory consumption reasons, since some scoring operations require sorting and collecting the whole data.
K-fold cross-test and hyperparameter optimization (grid search) are not supported.
Build partitioned models with MLLib backend is not supported
Post training computations like partial dependence plot and subpopulation analysis are not supported
Individual explanations are not supported
Containerized execution is not supported
Debugging capabilities such as Assertions and Diagnostics are not supported
Interactive scoring is not available