H2O (Sparkling Water) engine

Sparkling Water is H2O’s support for machine learning with Spark.

DSS can train H2O algorithms by creating a H2O cluster on top of your existing Spark cluster using Sparkling Water. This seamless integration allows the use of all the options available for a traditional MLLib backend, along with the additional capabilities provided by H2O.

Warning

Distributed machine learning’s overhead is non-negligible.

If your data fits into memory, you should consider using regular in-memory ML instead for faster learning and more extensive options and algorithms.

Setup

To use Sparkling Water, you must first have a working Spark installation and DSS should be configured for Spark integration. See DSS and Spark for more information about Spark in Data Science Studio.

To setup Sparkling Water, run the installation script in the DSS data directory:

./bin/dssadmin install-h2o-integration

If no further arguments are supplied, the version of the Sparkling Water assembly jar corresponding to your Spark installation will be downloaded.

If your machine is not connected to internet, you will have to provide your own distribution of Sparkling Water, specify the path of the unzipped folder using the -sparklingWaterDir /path/to/sparkling-water option.

Using Sparkling Water

To train an H2O algorithm, simply create a new analysis, create a model, and choose the H2O backend in the backend list.

Prediction Algorithms

The following H2O algorithms are supported:

  • Deep Learning (regression & classification)
  • GBM (regression & classification)
  • GLM (regression & classification)
  • Random Forest (regression & classification)
  • Naive Bayes (multiclass classification)

Clustering algorithms

The following H2O algorithms are supported:

  • KMeans (clustering)

Limitations

Limitations are the same as those for the MLLib backend.
In addition, users should note that:

  • The Naive Bayes algorithm only functions with categorical variables.
  • Due to an implementation bug, H2O’s GLM algorithm does not handle unhandled categorical features well. It is recommended to use dummification, as this will have the same effect in terms of algorithm performance, without the risk of errors.
  • The H2O cluster UI (generally accessible on port 54321) is not accessible

Memory requirements

Unlike MLLib, Sparkling Water requires that the whole training set fit into the distributed memory (ie, the sum of all memories of all Spark executors).

Insufficient memory allocation to Spark executors could result in job failure or hang. You might need to tune the `spark.executor.memory` Spark configuration key.