DSS and Spark¶
Spark is a general engine for distributed computation. You can think of Spark as a faster and more convenient replacement for MapReduce. Once Spark integration is setup, Data Science Studio will offer settings to choose Spark as a job’s execution engine in various components.
SparkSQL recipes globally work like SQL Recipes but are not limited to SQL datasets. Data Science Studio will fetch the data and pass it on to Spark. You can set the Spark configuration in the Advanced tab.
- For the Prepare recipe, select the engine and its configuration in the Advanced settings.
- For the Join, Group (Aggregate) and Stack recipes, select the engine under the Run button in the recipe’s main tab and set its configuration in the Advanced tab.
Interaction with DSS datasets is provided through a dedicated DSS Spark API, that makes it easy to create SparkSQL dataframes from datasets.
The Jupyter notebook built-in with DSS has support for both Pyspark and SparkR
Spark’s overhead is non-negligible and its support has some limitations (see Limitations). If your data fits in the memory of a single machine, other execution engines might be faster and easier to tune. It is recommended that you only use Spark for data that does not fit in the memory of a single machine.
DSS can read and write all datasets using Spark.
If you have a Spark engine (which does not need a cluster installation), using Spark on local datasets can still bring performance improvements for the Grouping and Join recipes.
However, as of DSS 2.1, only HDFS datasets fully benefit from the Spark distributed nature.
Non-HDFS datasets are read and written using a single Spark partition (not to be confused with DSS partitions). A single Spark partition is restricted to 2GB max data. Therefore, if your dataset is larger, you will need to repartition it.
- In PySpark and SparkR recipes, you need to use the SparkSQL API to repartition a dataframe
- In SparkSQL, Visual preparation, MLLib and VisualSQL recipes, repartitioning is automatic. You can configure it in the various Advanced tabs.
Spark has a many configuration options and you will probably need to use several configurations according to what you do, which data you use, etc. For instance you may want to use spark “locally” (on the DSS server) for some jobs and on YARN on your Hadoop cluster for others, or specify the allocated memory for each worker…
- As administrator, in the general settings (from the Administration menu), in the Spark section, you can add / remove / edit named “template” configurations, in which you can set Spark options by key/value pairs. See the Spark configuration documentation
- At every place where you can prepare a Spark job, you will have to choose the base template configuration to use, and optionally additional / override configuration options for that specific job.
- In most recipes that can load non-HDFS datasets (or sampled HDFS datasets), datasets are loaded as a single partition. They must be repartitioned so that every partition fits in a Spark worker’s RAM. There is a Repartition non-HDFS inputs settings to specify in how many partitions it should be split.
Spark’s additional possibilities come with a few limitations:
- Sampling with filter is not supported for input datasets, prefer a filtering recipe instead.
- As of DSS 2.1, HDFS datasets perform much better on Spark than other datasets, for both reading and writing.
- Sampling an HDFS dataset (except with a fixed ratio) can be slower than loading it unsampled.