Execution engines

Design of the preparation

The design of a data preparation is always done on an in-memory sample of the data. See Sampling for more information.

Execution in analysis

When in an analysis, execution on the whole dataset happens when:

  • Exporting the prepared data
  • Running a machine learning model

In both cases, this uses a streaming engine: all data goes through the DSS server but does not need to be in memory.

Execution of the recipe

For execution of the recipe, DSS provides three execution engines:

DSS

All data goes through the DSS server but does not need to be in memory (as it is streamed)

Spark

When Spark is installed (see: DSS and Spark), preparation recipe jobs can run on Spark.

We recommend that you only use this on HDFS or S3 datasets.

In-database (SQL)

A subset of the preparation processors can be translated to SQL queries. When all processors in a preparation recipe can be translated, and both input and output are tables in the same SQL connection, the recipe runs fully in-database.

Please see the warnings and limitations below

Hadoop Mapreduce (deprecated)

When both the input and output datasets of a Data Preparation recipe are supported HDFS datasets, the data preparation recipe can run fully on Hadoop, as a MapReduce job.

Warning

This engine is deprecated and may be removed in a future release. We recommend that you use Spark instead

Details on the in-database (SQL) engine

Only a subset of processors can be translated to SQL queries. They are documented in the processors reference. The SQL engine can only be selected if all processors are compatible with it.

If you add a non-supported processor while the in-database engine is selected, DSS will show which processor cannot be used with details.

Note

There are some edge cases of columns that change type where DSS may show the engine as supported, but upon running the recipe, you encounter a syntax error. If that happens, you will need to disable the SQL engine and fall back to the DSS engine.

Some of these edge cases relate to type conflicts, if for example you have a textual column and perform a find/replace operation that transforms it into a numerical column and immediately use it for numerical operations.

Supported processors

These processors should always be available with SQL processing

  • Keep/Delete columns
  • Reorder columns
  • Rename columns
  • Split columns
  • Filter by alphanumerical value
  • Filter by numerical range
  • Flag by alphanumerical value
  • Flag by numerical range
  • Remove rows with empty value
  • Fill empty cells with value
  • Concatenate columns
  • Copy columns
  • Unfold
  • Split and unfold

Partially supported processors

In some variants of configuration of the processor, it will revert to a normal processing. Various issues may also appear and require you to switch back to DSS engine.

  • Formula (essentially same support as in other visual recipes)
  • Filter by formula (see above)
  • Flag by formula (see above)
  • Find / Replace (especially around regular expressions)
  • String transformations (depends on the transformation)
  • Extract with regular expression
  • Date-handling processors (parse date, extract date compmonents)

Details on the Spark engine

All processors are compatible with the Spark engine.

A subset of processors also have an optimized Spark version that runs up to several times faster than the default implementation.

When all processors in a prepare recipe have the optimized Spark version, the whole recipe will run with “Spark (Optimized)” engine instead of “Spark (Regular)”.

Warning

Tier 2 support: Optimized Spark engine is covered by Tier 2 support

It is possible to disable the optimized version and fallback to the regular implementation by going to Advanced and disabling “Use optimized implementation”. If you encounter issues, the Dataiku Support team may direct you to disable the optimized version.