Execution engines

Design of the preparation

The design of a data preparation is always done on an in-memory sample of the data. See Sampling for more information.

Execution in analysis

When in an analysis, execution on the whole dataset happens when:

  • Exporting the prepared data
  • Running a machine learning model

In both cases, this uses a streaming engine: all data goes through the DSS server but does not need to be in memory.

Execution of the recipe

For execution of the recipe, DSS provides three execution engines:

Streaming

All data goes through the DSS server but does not need to be in memory.

Hadoop Mapreduce

When both the input and output datasets of a Data Preparation recipe are supported HDFS datasets, the data preparation recipe can run fully on Hadoop, as a MapReduce job.

To enable this behavior, go to the Settings / Build tab of the data preparation recipe and check “Run on Hadoop”. You do not need to fill the “Split size” parameter.

Spark

When Spark is installed (see: DSS and Spark), preparation recipe jobs can run on Spark.

We recommend that you only use this on HDFS or S3 datasets.