Machine learning with Spark (MLLib)

MLLib is Spark’s machine learning library. DSS can use it to train prediction models on your large datasets that don’t fit into memory.


Spark’s overhead is non-negligible and its support is limited (see Limitations).
If your data fits into memory, you should use regular in-memory ML instead for faster learning and more extensive options and algorithms.


When you create a new prediction model in an Analysis, you can select the backend. By default it’s Python in memory, but if you have Spark correctly set up, you can see the available Spark configurations. Select one of them, and your model will be trained on Spark.

You can then fine-tune your model, deploy it in the Flow as a retrainable model and apply it in a scoring recipe to perform prediction on unlabelled datasets.

In the model’s settings, the training recipe and the scoring recipe, there is an additional Spark config section, in which you can:

  • Change the base Spark configuration
  • Add / override Spark configuration options
  • Select the storage level of the dataset for caching once the data is loaded and prepared
  • Select the number of Spark RDD partitions to split non-HDFS input datasets

See DSS and Spark for more information about Spark in Data Science Studio.


On top of the general Spark limitations in DSS, MLLib has specific limitations:

  • DSS 2.3.0 supports the following algorithms on MLLib:
    • Logistic Regression (classification)
    • Linear Regression (regression)
    • Random Forest (classification & regression)
    • Gradient Boosted Trees (binary classification & regression)
  • Gradient Boosted Trees in MLLib does not output per-class probabilities, so there is no threshold to set, and some metrics (AUC, Log loss, Lift) are not available, as are some report sections (variable importance, decision & lift charts, ROC curve).
  • Some feature preprocessing are not available (although most can be achieved by other means):
    • Features generation
    • Numerical handling other than regular
    • Impute with median
    • Categorical handling other than dummy encoding
    • Text handling other than Tokenize, hash & count
  • If test dataset is larger than 1 million rows, it will be subsampled to ~1M rows for performance and memory consumption reasons, since some scoring operations require sorting and collecting the whole data.
  • K-fold and grid search are not supported.