H2O (Sparkling Water) engine¶
Sparkling Water is H2O‘s support for machine learning with Spark.
DSS can train H2O algorithms by creating a H2O cluster on top of your existing Spark cluster using Sparkling Water. This seamless integration allows the use of all the options available for a traditional MLLib backend, along with the additional capabilities provided by H2O.
Distributed machine learning’s overhead is non-negligible.
If your data fits into memory, you should consider using regular in-memory ML instead for faster learning and more extensive options and algorithms.
To use Sparkling Water, you must first have a working Spark installation and DSS should be configured for Spark integration. See DSS and Spark for more information about Spark in Data Science Studio.
To setup Sparkling Water, run the installation script in the DSS data directory:
If no further arguments are supplied, the version of the Sparkling Water assembly jar corresponding to your Spark installation will be downloaded.
If your machine is not connected to internet, you will have to provide your own distribution of Sparkling Water,
specify the path of the unzipped folder using the
-sparklingWaterDir /path/to/sparkling-water option.
To train an H2O algorithm, simply create a new analysis, create a model, and choose the H2O backend in the backend list.
The following H2O algorithms are supported:
- Deep Learning (regression & classification)
- GBM (regression & classification)
- GLM (regression & classification)
- Random Forest (regression & classification)
- Naive Bayes (multiclass classification)
Limitations are the same as those for the MLLib backend.
In addition, users should note that:
- The Naive Bayes algorithm only functions with categorical variables.
- Due to an implementation bug, H2O’s GLM algorithm does not handle unhandled categorical features well. It is recommended to use dummification, as this will have the same effect in terms of algorithm performance, without the risk of errors.
- The H2O cluster UI (generally accessible on port 54321) is not accessible
Unlike MLLib, Sparkling Water requires that the whole training set fit into the distributed memory (ie, the sum of all memories of all Spark executors).
Insufficient memory allocation to Spark executors could result in job failure or hang. You might need to tune the
`spark.executor.memory` Spark configuration key.