One of the greatest strengths of Spark is its ability to execute long data pipelines with multiple steps without always having to write the intermediate data and re-read it at the next step.
In DSS, each recipe reads some datasets and writes some datasets. For example:
- A grouping recipe will read from the storage the input dataset, perform the grouping and write the grouped dataset to its storage.
- A PySpark recipe will direct Spark to read the input(s), perform the whole Spark computation defined by the PySpark recipe and then direct Spark to write the output(s)
With this behavior:
- When writing a coding Spark recipe (PySpark, SparkR, Spark-Scala or SparkSQL), you can write arbitrarily complex data processing pipelines with an arbitrary number of Spark steps and operations, and DSS will send this to Spark.
- However, when having a chain of visual recipes (a grouping then a prepare then a join then a grouping ...), each recipe is executed independently, and the data is materialized in each dataset.
If all of these visual recipes are Spark-enabled, it is possible to avoid the read-write-read-write cycle using Spark pipelines. When several consecutive recipes in a DSS Flow (including with branches or splits) use the Spark engine, DSS can automatically merge all of these recipes and run them as a single Spark job, called a Spark pipeline. This strongly boosts performance by avoiding needless writes and reads of intermediate datasets, and also alleviates Spark startup overheads.
A Spark pipeline covers multiple recipes, and thus one or more intermediate datasets which are part of the pipeline. You can configure the behavior of the pipeline for each of these intermediate datasets:
- Either this dataset is not meaningful nor useful by itself: it is only required as an intermediate step to feed recipes down the Flow. In that case, DSS will not at all write the data of this intermediate dataset when executing the Spark pipeline.
- Or the data in this dataset is actually useful (maybe because you have charts or notebooks using it). In that case, during the execution of the pipeline, DSS will still write the intermediate dataset.
Writing intermediate datasets reduces the performance gain of using a Spark pipeline, but does not negate it since the burden of re-reading the dataset afterwards is still alleviated. It also mutualizes startup overheads.
As of DSS 4.0, the Spark pipeline feature is considered as experimental. In case of issues, you always have the option to disable Spark pipelines on a per-project basis.
Merging of Spark pipelines is not enabled by default in DSS. It must be enabled on a per-project basis.
- Go to the project settings
- Go to “Build”
- Enable “Merge Spark pipelines”
- Save settings
You don’t need to do anything special to get Spark pipelines. Each time you run a build job, DSS will evaluate whether one or several Spark pipelines can be created and will run them automatically.
You can check whether a Spark pipeline has been created in the job’s results page. On the left part of the screen, “Spark pipeline (xx activities)” will appear and mention how many recipe or recipe executions were merged together.
The behavior of intermediate datasets can be configured by the user: write them or not (only the final datasets are written in that case).
This behavior is configured per dataset.
- Go to the dataset’s settings page
- Go to Advanced
- Check “Virtualizable in build” if you don’t require this dataset to be built when a Spark pipeline includes it.